Credit Risk Modeling¶

A project realised by Dr-eng.: Khalifa Mejbri, Expert in Engineering, Data Science and Machine Learning.¶

Part 1. Data Preprocessing and Feature Engineering¶

The first phase of this project is dedicated to rigorous data preprocessing and feature engineering, which are critical steps to ensure the development of a robust and reliable credit risk prediction model. This stage involves transforming raw data into a clean and structured format, ready for effective modeling. The dataset used in this project originates from Lending Club’s loan data, covering a wide range of borrower and loan attributes collected between 2007 and 2018.

The data processing workflow begins with importing and inspecting the dataset to identify quality issues, inconsistencies, and missing values. An initial analysis of missing data patterns is performed, and features with extremely high proportions of missing values—deemed to contribute little or no value to modeling—are systematically dropped. For remaining variables with manageable levels of missingness, appropriate imputation strategies are applied to preserve the integrity of the dataset.

A crucial early step is the construction of the target variable, which represents the credit outcome (e.g., default or non-default) and is derived from relevant loan status fields. Once defined, the dataset is split into training and testing sets to ensure that model evaluation is performed on unseen data, thereby promoting generalizability and reducing overfitting.

Next, categorical variables are identified and encoded. Discrete features with meaningful class distributions are transformed using categorical encoding techniques that retain interpretability and predictive power. Simultaneously, continuous numerical variables are examined for multicollinearity using Variance Inflation Factor (VIF) analysis. Highly collinear features are removed to reduce redundancy and prevent unstable coefficient estimation in downstream modeling.

Following this, continuous variables are categorized using Weight of Evidence (WoE) binning—a technique that aligns with the monotonic relationship between predictors and the binary target variable. This also enables the calculation of Information Value (IV), which helps assess each feature’s predictive strength.

The final part of preprocessing includes extensive feature engineering, which involves constructing new variables, aggregating existing information, and deriving ratios or interaction terms that better capture the financial behavior of borrowers. These transformations are guided by both domain knowledge and data-driven insights.

After completing these steps, the resulting dataset comprises 344 curated features, which form the foundation for building and training the credit risk model in the subsequent stages of the project.

I. General Data Preparation¶

Import Libraries¶

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

Dataset Description¶

The dataset used in this project contains detailed loan-level data from Lending Club, a prominent U.S. peer-to-peer lending platform. It covers accepted loan applications from 2007 to the fourth quarter of 2018, totaling more than 1 million individual loans (🔗 https://www.kaggle.com/datasets/thegurus/loan-data-accepted).

This dataset was obtained from Kaggle, and it includes a wide range of borrower and loan characteristics, such as:

  • Applicant financial data: income, employment length, debt-to-income ratio, etc.
  • Loan terms: loan amount, interest rate, installment, loan purpose.
  • Credit history: delinquency counts, public records, revolving balance, credit age.
  • Performance data: loan status, payment history, amount repaid, outstanding principal, and more.

To align with a real-world modeling scenario, we divided the dataset into two subsets:

  • A training set consisting of loan applications up to a given point in time, used for developing the Expected Loss (EL) components—Probability of Default (PD), Loss Given Default (LGD), and Exposure at Default (EAD).

  • A test set containing applications submitted after the PD model was trained, used to evaluate how well the model generalizes to new data.

This approach enables us to test the temporal robustness of the PD model and assess whether newer applicants exhibit similar characteristics to those in the historical training data.

In [4]:
# Import data (accepted and rejected demands).
loan_data_accepted = pd.read_csv('C:/Disc D/365DataScience/Credit risk modeling/Self_project/Data2/accepted_2007_to_2018Q4.csv',low_memory=False)
loan_data_rejected = pd.read_csv('C:/Disc D/365DataScience/Credit risk modeling/Self_project/Data2/rejected_2007_to_2018Q4.csv',low_memory=False)
In [5]:
# Copy the dataframe.
loan_data_accepted = loan_data_accepted.copy()
loan_data_rejected = loan_data_rejected.copy()
In [6]:
# Expose the first 5 rows of the accepted loans.
pd.options.display.max_columns = None
loan_data_accepted.head()
Out[6]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit revol_bal_joint sec_app_fico_range_low sec_app_fico_range_high sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_act_il sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog hardship_flag hardship_type hardship_reason hardship_status deferral_term hardship_amount hardship_start_date hardship_end_date payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term
0 68407277 NaN 3600.0 3600.0 3600.0 36 months 13.99 123.03 C C4 leadman 10+ years MORTGAGE 55000.0 Not Verified Dec-2015 Fully Paid n https://lendingclub.com/browse/loanDetail.acti... NaN debt_consolidation Debt consolidation 190xx PA 5.91 0.0 Aug-2003 675.0 679.0 1.0 30.0 NaN 7.0 0.0 2765.0 29.7 13.0 w 0.00 0.00 4421.723917 4421.72 3600.00 821.72 0.0 0.0 0.0 Jan-2019 122.67 NaN Mar-2019 564.0 560.0 0.0 30.0 1.0 Individual NaN NaN NaN 0.0 722.0 144904.0 2.0 2.0 0.0 1.0 21.0 4981.0 36.0 3.0 3.0 722.0 34.0 9300.0 3.0 1.0 4.0 4.0 20701.0 1506.0 37.2 0.0 0.0 148.0 128.0 3.0 3.0 1.0 4.0 69.0 4.0 69.0 2.0 2.0 4.0 2.0 5.0 3.0 4.0 9.0 4.0 7.0 0.0 0.0 0.0 3.0 76.9 0.0 0.0 0.0 178050.0 7746.0 2400.0 13734.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
1 68355089 NaN 24700.0 24700.0 24700.0 36 months 11.99 820.28 C C1 Engineer 10+ years MORTGAGE 65000.0 Not Verified Dec-2015 Fully Paid n https://lendingclub.com/browse/loanDetail.acti... NaN small_business Business 577xx SD 16.06 1.0 Dec-1999 715.0 719.0 4.0 6.0 NaN 22.0 0.0 21470.0 19.2 38.0 w 0.00 0.00 25679.660000 25679.66 24700.00 979.66 0.0 0.0 0.0 Jun-2016 926.35 NaN Mar-2019 699.0 695.0 0.0 NaN 1.0 Individual NaN NaN NaN 0.0 0.0 204396.0 1.0 1.0 0.0 1.0 19.0 18005.0 73.0 2.0 3.0 6472.0 29.0 111800.0 0.0 0.0 6.0 4.0 9733.0 57830.0 27.1 0.0 0.0 113.0 192.0 2.0 2.0 4.0 2.0 NaN 0.0 6.0 0.0 5.0 5.0 13.0 17.0 6.0 20.0 27.0 5.0 22.0 0.0 0.0 0.0 2.0 97.4 7.7 0.0 0.0 314017.0 39475.0 79300.0 24667.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
2 68341763 NaN 20000.0 20000.0 20000.0 60 months 10.78 432.66 B B4 truck driver 10+ years MORTGAGE 63000.0 Not Verified Dec-2015 Fully Paid n https://lendingclub.com/browse/loanDetail.acti... NaN home_improvement NaN 605xx IL 10.78 0.0 Aug-2000 695.0 699.0 0.0 NaN NaN 6.0 0.0 7869.0 56.2 18.0 w 0.00 0.00 22705.924294 22705.92 20000.00 2705.92 0.0 0.0 0.0 Jun-2017 15813.30 NaN Mar-2019 704.0 700.0 0.0 NaN 1.0 Joint App 71000.0 13.85 Not Verified 0.0 0.0 189699.0 0.0 1.0 0.0 4.0 19.0 10827.0 73.0 0.0 2.0 2081.0 65.0 14000.0 2.0 5.0 1.0 6.0 31617.0 2737.0 55.9 0.0 0.0 125.0 184.0 14.0 14.0 5.0 101.0 NaN 10.0 NaN 0.0 2.0 3.0 2.0 4.0 6.0 4.0 7.0 3.0 6.0 0.0 0.0 0.0 0.0 100.0 50.0 0.0 0.0 218418.0 18696.0 6200.0 14877.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
3 66310712 NaN 35000.0 35000.0 35000.0 60 months 14.85 829.90 C C5 Information Systems Officer 10+ years MORTGAGE 110000.0 Source Verified Dec-2015 Current n https://lendingclub.com/browse/loanDetail.acti... NaN debt_consolidation Debt consolidation 076xx NJ 17.06 0.0 Sep-2008 785.0 789.0 0.0 NaN NaN 13.0 0.0 7802.0 11.6 17.0 w 15897.65 15897.65 31464.010000 31464.01 19102.35 12361.66 0.0 0.0 0.0 Feb-2019 829.90 Apr-2019 Mar-2019 679.0 675.0 0.0 NaN 1.0 Individual NaN NaN NaN 0.0 0.0 301500.0 1.0 1.0 0.0 1.0 23.0 12609.0 70.0 1.0 1.0 6987.0 45.0 67300.0 0.0 1.0 0.0 2.0 23192.0 54962.0 12.1 0.0 0.0 36.0 87.0 2.0 2.0 1.0 2.0 NaN NaN NaN 0.0 4.0 5.0 8.0 10.0 2.0 10.0 13.0 5.0 13.0 0.0 0.0 0.0 1.0 100.0 0.0 0.0 0.0 381215.0 52226.0 62500.0 18000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
4 68476807 NaN 10400.0 10400.0 10400.0 60 months 22.45 289.91 F F1 Contract Specialist 3 years MORTGAGE 104433.0 Source Verified Dec-2015 Fully Paid n https://lendingclub.com/browse/loanDetail.acti... NaN major_purchase Major purchase 174xx PA 25.37 1.0 Jun-1998 695.0 699.0 3.0 12.0 NaN 12.0 0.0 21929.0 64.5 35.0 w 0.00 0.00 11740.500000 11740.50 10400.00 1340.50 0.0 0.0 0.0 Jul-2016 10128.96 NaN Mar-2018 704.0 700.0 0.0 NaN 1.0 Individual NaN NaN NaN 0.0 0.0 331730.0 1.0 3.0 0.0 3.0 14.0 73839.0 84.0 4.0 7.0 9702.0 78.0 34000.0 2.0 1.0 3.0 10.0 27644.0 4567.0 77.5 0.0 0.0 128.0 210.0 4.0 4.0 6.0 4.0 12.0 1.0 12.0 0.0 4.0 6.0 5.0 9.0 10.0 7.0 19.0 6.0 12.0 0.0 0.0 0.0 4.0 96.6 60.0 0.0 0.0 439570.0 95768.0 20300.0 88097.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Cash N NaN NaN NaN NaN NaN NaN
In [7]:
# Expose the first 5 rows of the rejected loans.
loan_data_rejected.sample(5)
Out[7]:
Amount Requested Application Date Loan Title Risk_Score Debt-To-Income Ratio Zip Code State Employment Length Policy Code
8910244 15000.0 2014-08-21 debt_consolidation 603.0 32.33% 471xx IN < 1 year 0.0
12365756 25000.0 2017-08-06 Credit card refinancing 622.0 46.58% 713xx LA < 1 year 0.0
18011080 25000.0 2018-05-01 Debt consolidation NaN 54.75% 322xx FL < 1 year 0.0
11160554 15000.0 2018-03-13 Debt consolidation NaN 11.5% 112xx NY < 1 year 0.0
17248026 2500.0 2018-12-31 Car financing NaN 0.54% 453xx OH 1 year 0.0
In [8]:
loan_data_accepted.shape
Out[8]:
(2260701, 151)

Explore Data¶

In [9]:
loan_data = loan_data_accepted.copy()
pd.options.display.max_columns = None
#pd.options.display.max_rows = None
# Sets the pandas dataframe options to display all columns/ rows.
In [10]:
loan_data.columns.values
# Displays all column names.
Out[10]:
array(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'verification_status', 'issue_d', 'loan_status', 'pymnt_plan',
       'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti',
       'delinq_2yrs', 'earliest_cr_line', 'fico_range_low',
       'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'last_fico_range_high', 'last_fico_range_low',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint',
       'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt',
       'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m',
       'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util',
       'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util',
       'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
       'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',
       'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct',
       'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
       'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq',
       'mths_since_recent_inq', 'mths_since_recent_revol_delinq',
       'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl',
       'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl',
       'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats',
       'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m',
       'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75',
       'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim',
       'total_bal_ex_mort', 'total_bc_limit',
       'total_il_high_credit_limit', 'revol_bal_joint',
       'sec_app_fico_range_low', 'sec_app_fico_range_high',
       'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths',
       'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util',
       'sec_app_open_act_il', 'sec_app_num_rev_accts',
       'sec_app_chargeoff_within_12_mths',
       'sec_app_collections_12_mths_ex_med',
       'sec_app_mths_since_last_major_derog', 'hardship_flag',
       'hardship_type', 'hardship_reason', 'hardship_status',
       'deferral_term', 'hardship_amount', 'hardship_start_date',
       'hardship_end_date', 'payment_plan_start_date', 'hardship_length',
       'hardship_dpd', 'hardship_loan_status',
       'orig_projected_additional_accrued_interest',
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'disbursement_method', 'debt_settlement_flag',
       'debt_settlement_flag_date', 'settlement_status',
       'settlement_date', 'settlement_amount', 'settlement_percentage',
       'settlement_term'], dtype=object)
In [11]:
loan_data.info()
# Displays column names, complete (non-missing) cases per column, and datatype per column.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB

General Preprocessing¶

Preprocessing few continuous variables¶

Variable 'emp_length'¶

In [12]:
loan_data['emp_length'].unique()
# Displays unique values of a column.
Out[12]:
array(['10+ years', '3 years', '4 years', '6 years', '1 year', '7 years',
       '8 years', '5 years', '2 years', '9 years', '< 1 year', nan],
      dtype=object)
In [13]:
loan_data['emp_length_int'] = loan_data['emp_length'].str.replace('\\+ years', '', regex=True)
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('< 1 year', str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('n/a',  str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' year', '')
# We store the preprocessed ‘employment length’ variable in a new variable called ‘employment length int’,
# We assign the new ‘employment length int’ to be equal to the ‘employment length’ variable with the string ‘+ years’
# replaced with nothing. Next, we replace the whole string ‘less than 1 year’ with the string ‘0’.
# Then, we replace the ‘n/a’ string with the string ‘0’. Then, we replace the string ‘space years’ with nothing.
# Finally, we replace the string ‘space year’ with nothing.
In [14]:
type(loan_data['emp_length_int'][0])
# Checks the datatype of a single element of a column.
Out[14]:
str
In [15]:
loan_data['emp_length_int'].value_counts()
Out[15]:
emp_length_int
10    748005
2     203677
0     189988
3     180753
1     148403
5     139698
4     136605
6     102628
7      92695
8      91914
9      79395
Name: count, dtype: int64
In [16]:
# Replace "Unknown" with NaN and then convert to numeric
loan_data['emp_length_int'] = loan_data['emp_length_int'].replace('Unknown', np.nan)

# Transforms the values to numeric.
loan_data['emp_length_int'] = pd.to_numeric(loan_data['emp_length_int'])

Fill with the mode (most frequent)¶

In [17]:
loan_data = loan_data.copy()

# Mode imputation
loan_data['emp_length_int'] = loan_data['emp_length_int'].fillna(loan_data['emp_length_int'].mode()[0])
In [18]:
loan_data['emp_length_int'].value_counts()
Out[18]:
emp_length_int
10.0    894945
2.0     203677
0.0     189988
3.0     180753
1.0     148403
5.0     139698
4.0     136605
6.0     102628
7.0      92695
8.0      91914
9.0      79395
Name: count, dtype: int64
In [19]:
# Checking the percentage of the missing values for each category
loan_data['emp_length_int'].isnull().mean()
Out[19]:
0.0

Convert string variable to date and time¶

In [20]:
loan_data = loan_data.drop(columns = ['emp_length'])

Variable 'term'¶

In [21]:
loan_data['term'].describe()
# Shows some descriptive statisics for the values of a column.
Out[21]:
count        2260668
unique             2
top        36 months
freq         1609754
Name: term, dtype: object
In [22]:
loan_data['term_int'] = loan_data['term'].str.replace(' months', '')
# We replace a string with another string, in this case, with an empty strng (i.e. with nothing).
In [23]:
loan_data['term_int'].sample(5)
Out[23]:
1790274     60
1752897     36
1594219     36
175338      60
1387009     60
Name: term_int, dtype: object
In [24]:
type(loan_data['term_int'][25])
# Checks the datatype of a single element of a column.
Out[24]:
str
In [25]:
loan_data['term_int'] = pd.to_numeric(loan_data['term'].str.replace(' months', ''))
# We remplace a string from a variable with another string, in this case, with an empty string (i.e. with nothing).
# We turn the result to numeric datatype and save it in another variable.
loan_data['term_int'].sample(5)
Out[25]:
1682884    36.0
575675     36.0
1187779    36.0
821812     36.0
1579849    36.0
Name: term_int, dtype: float64
In [26]:
type(loan_data['term_int'][0])
# Checks the datatype of a single element of a column.
Out[26]:
numpy.float64

Variable 'issue_d'¶

In [27]:
loan_data['issue_d'].sample(5)
Out[27]:
1662795    Mar-2017
1818823    Jul-2013
270902     May-2015
342562     Mar-2015
1418687    Nov-2018
Name: issue_d, dtype: object
In [28]:
# Assume we are now in December 2020
loan_data['issue_d_date'] = pd.to_datetime(loan_data['issue_d'], format='mixed', errors='coerce')
# Extracts the date and the time from a string variable that is in a given format.
loan_data['mths_since_issue_d'] = round(pd.to_numeric((pd.to_datetime('2020-12-01') - loan_data['issue_d_date']) / np.timedelta64(30, 'D')))
# We calculate the difference between two dates in months, turn it to numeric datatype and round it.
# We save the result in a new variable.
loan_data['mths_since_issue_d'].describe()
# Shows some descriptive statisics for the values of a column.
Out[28]:
count    2.260668e+06
mean     5.581604e+01
std      2.189573e+01
min      2.400000e+01
25%      3.800000e+01
50%      5.400000e+01
75%      6.900000e+01
max      1.640000e+02
Name: mths_since_issue_d, dtype: float64

Check for missing values and clean¶

In [29]:
# Checking the percentage of the missing values for each category
missing_percent = loan_data.isnull().mean() * 100
missing_percent = missing_percent[missing_percent >= 0].sort_values(ascending=False)

# Set option to display all rows
pd.set_option('display.max_rows', None)

# Display the result
print(missing_percent)
member_id                                     100.000000
orig_projected_additional_accrued_interest     99.617331
hardship_loan_status                           99.517097
hardship_last_payment_amount                   99.517097
deferral_term                                  99.517097
hardship_status                                99.517097
hardship_reason                                99.517097
hardship_type                                  99.517097
hardship_end_date                              99.517097
payment_plan_start_date                        99.517097
hardship_length                                99.517097
hardship_dpd                                   99.517097
hardship_amount                                99.517097
hardship_payoff_balance_amount                 99.517097
hardship_start_date                            99.517097
settlement_date                                98.485160
debt_settlement_flag_date                      98.485160
settlement_term                                98.485160
settlement_status                              98.485160
settlement_percentage                          98.485160
settlement_amount                              98.485160
sec_app_mths_since_last_major_derog            98.410139
sec_app_revol_util                             95.303050
revol_bal_joint                                95.221836
sec_app_mort_acc                               95.221792
sec_app_fico_range_low                         95.221792
sec_app_chargeoff_within_12_mths               95.221792
sec_app_collections_12_mths_ex_med             95.221792
sec_app_inq_last_6mths                         95.221792
sec_app_num_rev_accts                          95.221792
sec_app_open_act_il                            95.221792
sec_app_open_acc                               95.221792
sec_app_earliest_cr_line                       95.221792
sec_app_fico_range_high                        95.221792
verification_status_joint                      94.880791
dti_joint                                      94.660683
annual_inc_joint                               94.660506
desc                                           94.423632
mths_since_last_record                         84.113069
mths_since_recent_bc_dlq                       77.011511
mths_since_last_major_derog                    74.309960
mths_since_recent_revol_delinq                 67.250910
next_pymnt_d                                   59.509993
mths_since_last_delinq                         51.246715
il_util                                        47.281042
mths_since_rcnt_il                             40.251099
all_util                                       38.323555
total_cu_tl                                    38.313912
open_acc_6m                                    38.313912
inq_last_12m                                   38.313912
open_rv_12m                                    38.313868
open_rv_24m                                    38.313868
total_bal_il                                   38.313868
max_bal_bc                                     38.313868
open_il_24m                                    38.313868
open_il_12m                                    38.313868
inq_fi                                         38.313868
open_act_il                                    38.313868
mths_since_recent_inq                          13.069751
emp_title                                       7.387178
num_tl_120dpd_2m                                6.798334
mo_sin_old_il_acct                              6.153136
bc_util                                         3.366389
percent_bc_gt_75                                3.335779
bc_open_to_buy                                  3.316140
mths_since_recent_bc                            3.248771
pct_tl_nvr_dlq                                  3.116909
avg_cur_bal                                     3.113149
mo_sin_rcnt_rev_tl_op                           3.110097
num_rev_accts                                   3.110097
mo_sin_old_rev_tl_op                            3.110097
num_actv_bc_tl                                  3.110053
mo_sin_rcnt_tl                                  3.110053
num_actv_rev_tl                                 3.110053
num_accts_ever_120_pd                           3.110053
total_il_high_credit_limit                      3.110053
num_il_tl                                       3.110053
num_bc_tl                                       3.110053
total_rev_hi_lim                                3.110053
tot_hi_cred_lim                                 3.110053
num_tl_op_past_12m                              3.110053
num_tl_90g_dpd_24m                              3.110053
num_tl_30dpd                                    3.110053
num_rev_tl_bal_gt_0                             3.110053
num_op_rev_tl                                   3.110053
tot_cur_bal                                     3.110053
tot_coll_amt                                    3.110053
num_bc_sats                                     2.593134
num_sats                                        2.593134
total_bc_limit                                  2.214490
total_bal_ex_mort                               2.214490
acc_open_past_24mths                            2.214490
mort_acc                                        2.214490
title                                           1.033264
last_pymnt_d                                    0.108816
revol_util                                      0.081170
dti                                             0.077144
pub_rec_bankruptcies                            0.061839
collections_12_mths_ex_med                      0.007874
chargeoff_within_12_mths                        0.007874
tax_liens                                       0.006104
last_credit_pull_d                              0.004645
inq_last_6mths                                  0.002787
earliest_cr_line                                0.002743
pub_rec                                         0.002743
open_acc                                        0.002743
delinq_2yrs                                     0.002743
acc_now_delinq                                  0.002743
delinq_amnt                                     0.002743
total_acc                                       0.002743
annual_inc                                      0.001637
zip_code                                        0.001504
disbursement_method                             0.001460
hardship_flag                                   0.001460
debt_settlement_flag                            0.001460
term_int                                        0.001460
issue_d_date                                    0.001460
mths_since_issue_d                              0.001460
application_type                                0.001460
sub_grade                                       0.001460
url                                             0.001460
pymnt_plan                                      0.001460
loan_status                                     0.001460
issue_d                                         0.001460
verification_status                             0.001460
home_ownership                                  0.001460
grade                                           0.001460
policy_code                                     0.001460
installment                                     0.001460
int_rate                                        0.001460
term                                            0.001460
funded_amnt_inv                                 0.001460
funded_amnt                                     0.001460
loan_amnt                                       0.001460
purpose                                         0.001460
addr_state                                      0.001460
fico_range_low                                  0.001460
fico_range_high                                 0.001460
last_fico_range_low                             0.001460
last_fico_range_high                            0.001460
last_pymnt_amnt                                 0.001460
collection_recovery_fee                         0.001460
recoveries                                      0.001460
total_rec_late_fee                              0.001460
total_rec_int                                   0.001460
total_rec_prncp                                 0.001460
total_pymnt_inv                                 0.001460
total_pymnt                                     0.001460
out_prncp_inv                                   0.001460
out_prncp                                       0.001460
initial_list_status                             0.001460
revol_bal                                       0.001460
emp_length_int                                  0.000000
id                                              0.000000
dtype: float64

The following features with missing values > 90% can be dropped:¶

  • member_id: Completely missing (100%) — useless for modeling
  • orig_projected_additional_accrued_interest: Nearly all missing (~99.6%), irrelevant to creditworthiness
  • hardship_... features: ~99.5% missing, very sparse; relevant only to specialized hardship analysis
  • settlement_... features: ~98.5% missing, specific to post-default negotiation — not useful for default prediction before loan approval
  • sec_app_... features: ~95% missing, refer to secondary applicants — only useful for joint applications, which are rare
  • revol_bal_joint: ~95.2% missing, same reason as above
  • verification_status_joint, dti_joint, annual_inc_joint: ~94.6–94.8% missing, tied to joint applications — can be dropped for general credit risk modeling
  • desc: (94.4%) very sparse and unstructured (free text); not useful unless we plan to do NLP

Why it makes sense to drop them:

In credit risk modeling, especially when we are focused on individual (non-joint) loans, these features:

  • Won’t contribute meaningfully to predictive power
  • Might introduce noise or overfitting due to their sparsity
  • Can increase memory and computation time unnecessarily
In [30]:
features_to_drop = [
    'member_id', 'orig_projected_additional_accrued_interest',
    'hardship_end_date', 'deferral_term', 'hardship_status',
    'hardship_reason', 'hardship_type', 'hardship_payoff_balance_amount',
    'hardship_last_payment_amount', 'payment_plan_start_date',
    'hardship_amount', 'hardship_loan_status', 'hardship_start_date',
    'hardship_dpd', 'hardship_length', 'debt_settlement_flag_date',
    'settlement_date', 'settlement_amount', 'settlement_percentage',
    'settlement_term', 'settlement_status', 'sec_app_mths_since_last_major_derog',
    'sec_app_revol_util', 'revol_bal_joint', 'sec_app_inq_last_6mths',
    'sec_app_num_rev_accts', 'sec_app_open_act_il', 'sec_app_open_acc',
    'sec_app_mort_acc', 'sec_app_chargeoff_within_12_mths',
    'sec_app_collections_12_mths_ex_med', 'sec_app_fico_range_low',
    'sec_app_earliest_cr_line', 'sec_app_fico_range_high',
    'verification_status_joint', 'dti_joint', 'annual_inc_joint', 'desc'
]

# Drop the features with very high missing values (>90%)
loan_data = loan_data.drop(columns=features_to_drop)

Considering the following three features:¶

  • mths_since_last_record (84.1%): Credit delay history — could be useful, but very sparse
  • mths_since_recent_bc_dlq (77.0%): Credit delay (bank card delinquency) — consider keeping if strongly predictive
  • mths_since_last_major_derog (74.3%): Major derogatory marks — credit-relevant, but high sparsity

Recommendation:

  • Run a correlation or feature importance test (like Random Forest feature importance).
  • If any of them has low predictive power, it will be droped.
  • If you're prioritizing simplicity and generalizability, it’s okay to drop all three.

Evaluation of the predictive power of the three delay-related features using a Random Forest classifier and feature importance

In [31]:
# Step 1: Choose the relevant features and the target
delay_features = [
    'mths_since_last_record',
    'mths_since_recent_bc_dlq',
    'mths_since_last_major_derog'
]

# Example: Binary target variable preparation (adjust based on your dataset)
# Replace with appropriate mapping depending on your version of 'loan_status'
loan_data['target'] = loan_data['loan_status'].apply(lambda x: 1 if x in ['Charged Off', 'Default'] else 0)

# Step 2: Subset the data and drop rows with missing values in selected columns
df = loan_data[delay_features + ['target']].dropna()

# Step 3: Split into train/test
X = df[delay_features]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Step 5: Feature importance visualization
importances = model.feature_importances_
feat_importance = pd.Series(importances, index=delay_features).sort_values(ascending=True)

print(importances)
plt.figure(figsize=(8, 4))
sns.barplot(x=feat_importance, y=feat_importance.index, palette='viridis')
plt.title("Feature Importance (Random Forest)")
plt.xlabel("Importance Score")
plt.ylabel("Delay Feature")
plt.tight_layout()
plt.show()
[0.40604894 0.29841302 0.29553803]
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2205266688.py:38: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=feat_importance, y=feat_importance.index, palette='viridis')
No description has been provided for this image

Interpretation:¶

  • mths_since_last_record contributes most to predicting credit risk among the three, capturing about 40.6% of the predictive power in your mini-model.
  • The other two features are almost equally important and still carry decent predictive power (~29.8% and ~29.6%).

Since all three features have reasonable importance, we choose to Keep all of them as we are optimizing for model performance and don’t mind some sparsity due to missing values (or plan to impute them).

Fill Missing values of these 3 features:¶

These columns represent number of months since a certain delinquency event, so missing values usually mean "the event never occurred".

  • Best Practice: Impute with a large number (e.g., 999) to indicate “never had this event”.
  • This maintains the numeric nature of the variable and distinguishes between recent vs never.
In [32]:
# List of the features
delay_features = ['mths_since_last_record', 'mths_since_recent_bc_dlq', 'mths_since_last_major_derog']

# Fill missing values with 999
loan_data[delay_features] = loan_data[delay_features].fillna(999)

Features with moderate missingness (around 38–67%)¶

For these features, the imputation strategy depends on the data type and business logic of each feature. Since our project focuses on credit risk modeling, we’ll treat these with care.

  • Date-related / "Months since": Fill with a high number (e.g., 999) to indicate "No activity" or "Never delinquent".
  • Count-type features: Fill with 0 (meaning "no record", e.g., 0 inquiries).
  • Ratio/Utilization (%): Fill with median or domain-specific value (e.g., 0 or 100%).

Recommended Fill Strategies:

In [33]:
# Fill 'months since' features with high value to indicate 'no delinquency'
loan_data['mths_since_recent_revol_delinq'] = loan_data['mths_since_recent_revol_delinq'].fillna(999)
loan_data['mths_since_last_delinq'] = loan_data['mths_since_last_delinq'].fillna(999)
loan_data['mths_since_rcnt_il'] = loan_data['mths_since_rcnt_il'].fillna(999)

# Date - next scheduled payment, could be missing because loan is fully paid off
loan_data['next_pymnt_d'] = loan_data['next_pymnt_d'].fillna('No Payment Due')  # or use pd.NaT if you prefer datetime

# Utilization ratios - fill with median or 0
loan_data['il_util'] = loan_data['il_util'].fillna(loan_data['il_util'].median())
loan_data['all_util'] = loan_data['all_util'].fillna(loan_data['all_util'].median())

# Counts / Frequency - fill with 0 (no activity)
count_cols = [
    'total_cu_tl', 'open_acc_6m', 'inq_last_12m', 'total_bal_il', 'max_bal_bc',
    'open_il_12m', 'open_act_il', 'inq_fi', 'open_rv_12m', 'open_rv_24m', 'open_il_24m'
]
loan_data[count_cols] = loan_data[count_cols].fillna(0)

Features with missing values ranging from ~1% to ~13%¶

For the features listed above (with missing values ranging from ~1% to ~13%), here's a tailored imputation strategy that balances practicality, model performance, and data integrity for your credit risk modeling project.

Grouping by Feature Type and Imputation Strategy:

  • Delinquency / Inquiries / Behavior:

    • mths_since_recent_inq : (Fill with 999) ; Indicates no inquiries (consistent with other "mths_since" logic).
    • num_tl_120dpd_2m : (Fill with 0) ; No delinquency.
  • Account age / timelines:

    • mo_sin_old_il_acct, mo_sin_old_rev_tl_op, mo_sin_rcnt_rev_tl_op, mo_sin_rcnt_tl : (Fill with median) ; Age-based numeric values.
  • Utilization/ratio:

    • bc_util, percent_bc_gt_75, pct_tl_nvr_dlq, all_util : (Fill with median or domain-specific values (e.g., 0)) ; Percent values.
  • Balances / credit limits:

    • bc_open_to_buy, avg_cur_bal, total_rev_hi_lim, tot_cur_bal, total_il_high_credit_limit, tot_hi_cred_lim, total_bc_limit, total_bal_ex_mort : (Fill with median) ; Dollar amounts; median avoids skew.
  • Count features (accounts, inquiries): (Fill with 0) ; No accounts or events (safe assumption).

    • Features: num_rev_accts, num_accts_ever_120_pd, num_actv_bc_tl, num_actv_rev_tl, num_rev_tl_bal_gt_0, num_tl_90g_dpd_24m, num_tl_30dpd, num_tl_op_past_12m, num_op_rev_tl, num_il_tl, num_bc_tl, num_bc_sats, num_sats, mort_acc, acc_open_past_24mths, tot_coll_amt
  • Loan metadata (text): 'title'; Fill with "Unknown" or drop; Optional unstructured text.

  • mths_since_recent_bc: (3.25% missing)

    • Meaning: Number of months since the borrower's most recent bankcard account opened.
    • Type: Numeric, continuous (likely integer).
    • Strategy: Use median imputation — this is robust to outliers and appropriate for time-based features.
  • Employment-related:
    • emp_title : ("Unknown" or "Other") ; Free text, no standard format. Optionally drop or encode later.
In [34]:
loan_data = loan_data.drop(columns=['emp_title'])
In [37]:
# Months since events
loan_data['mths_since_recent_inq'] = loan_data['mths_since_recent_inq'].fillna(999)

# Utilization/ratio: median
ratio_cols = ['bc_util', 'percent_bc_gt_75', 'pct_tl_nvr_dlq']
loan_data[ratio_cols] = loan_data[ratio_cols].fillna(loan_data[ratio_cols].median())

# Timeline features: median
timeline_cols = ['mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl']
loan_data[timeline_cols] = loan_data[timeline_cols].fillna(loan_data[timeline_cols].median())

# Balances and limits: median
balance_cols = [
    'bc_open_to_buy', 'avg_cur_bal', 'total_rev_hi_lim', 'tot_cur_bal',
    'total_il_high_credit_limit', 'tot_hi_cred_lim', 'total_bc_limit', 'total_bal_ex_mort'
]
loan_data[balance_cols] = loan_data[balance_cols].fillna(loan_data[balance_cols].median())

# Count features: fill with 0
count_cols = [
    'num_rev_accts', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl',
    'num_rev_tl_bal_gt_0', 'num_tl_90g_dpd_24m', 'num_tl_30dpd', 'num_tl_op_past_12m',
    'num_op_rev_tl', 'num_il_tl', 'num_bc_tl', 'num_bc_sats', 'num_sats',
    'mort_acc', 'acc_open_past_24mths', 'tot_coll_amt', 'num_tl_120dpd_2m'
]
loan_data[count_cols] = loan_data[count_cols].fillna(0)

# Title (text field)
loan_data['title'] = loan_data['title'].fillna("Unknown")

# mths_since_recent_bc
loan_data['mths_since_recent_bc'] = loan_data['mths_since_recent_bc'].fillna(loan_data['mths_since_recent_bc'].median())

Variable 'earliest_cr_line'¶

In [38]:
loan_data['earliest_cr_line_date'] = pd.to_datetime(loan_data['earliest_cr_line'], format='mixed', errors='coerce')
# Extracts the date and the time from a string variable that is in a given format.
In [39]:
type(loan_data['earliest_cr_line_date'][0])
# Checks the datatype of a single element of a column.
Out[39]:
pandas._libs.tslibs.timestamps.Timestamp
In [40]:
# Assume we are now in December 2020
loan_data['mths_since_earliest_cr_line'] = round(pd.to_numeric((pd.to_datetime('2020-12-01') - loan_data['earliest_cr_line_date'])/np.timedelta64(30, 'D')))
# We calculate the difference between two dates in months, turn it to numeric datatype and round it.
# We save the result in a new variable.
In [41]:
loan_data['mths_since_earliest_cr_line'].describe()
# Shows some descriptive statisics for the values of a column.
# Dates from 1969 and before are not being converted well, i.e., they have become 2069 and similar,
# and negative differences are being calculated.
Out[41]:
count    2.260701e+06
mean     2.553759e+02
std      9.554257e+01
min      6.200000e+01
25%      1.900000e+02
50%      2.390000e+02
75%      3.050000e+02
max      1.068000e+03
Name: mths_since_earliest_cr_line, dtype: float64
In [42]:
loan_data['earliest_cr_line_date'].sample(5)
Out[42]:
405485    2000-01-01
1230975   1995-12-01
586343    2006-08-01
290012    2006-02-01
865494    1998-02-01
Name: earliest_cr_line_date, dtype: datetime64[ns]

For the following features, the missing rates are all very low (under ~0.1% for most), so we can confidently impute them using simple and fast methods without significant risk of bias or information loss. Here's a breakdown with recommended strategies:¶

  • *Date-related fields:* last_pymnt_d, last_credit_pull_d, earliest_cr_line_date, earliest_cr_line; use Mode or Most Recent Date (Dates are usually month/year — very low missing rate, fill with most frequent or recent).

  • *Ratios / Percentages:* revol_util, dti; use Median (Continuous; median is robust to outliers).

  • *Credit history indicators:* pub_rec_bankruptcies, collections_12_mths_ex_med, chargeoff_within_12_mths, tax_liens, pub_rec, delinq_2yrs, delinq_amnt, acc_now_delinq; fill with 0 (Missing likely means "none" (very common assumption in credit data).

  • *Credit activity counts:* open_acc, total_acc, inq_last_6mths, mths_since_earliest_cr_line; use Median or 0 (Continuous, stable values)

  • *Income:* annual_inc; use Median (Rarely missing, median better than mean due to skew).

  • *Zip code:* zip_code; use Mode (Categorical; most common zip is fine)

In [43]:
# 1. Date columns - fill with most frequent date or a placeholder (e.g., 'Jan-2019')
date_cols = ['last_pymnt_d', 'last_credit_pull_d', 'earliest_cr_line']
for col in date_cols:
    most_common_date = loan_data[col].mode()[0]
    loan_data[col] = loan_data[col].fillna(most_common_date)

# 2. Ratios: Fill with median
loan_data['revol_util'] = loan_data['revol_util'].fillna(loan_data['revol_util'].median())
loan_data['dti'] = loan_data['dti'].fillna(loan_data['dti'].median())

# 3. Credit public records: Assume 0 means no record
zero_fill_cols = [
    'pub_rec_bankruptcies', 'chargeoff_within_12_mths', 'collections_12_mths_ex_med',
    'tax_liens', 'pub_rec', 'delinq_2yrs', 'delinq_amnt', 'acc_now_delinq'
]
loan_data[zero_fill_cols] = loan_data[zero_fill_cols].fillna(0)

# 4. Credit counts: fill with median
count_cols = [
    'open_acc', 'total_acc', 'inq_last_6mths', 'mths_since_earliest_cr_line'
]
loan_data[count_cols] = loan_data[count_cols].fillna(loan_data[count_cols].median())

# 5. Annual income
loan_data['annual_inc'] = loan_data['annual_inc'].fillna(loan_data['annual_inc'].median())

# 6. Zip code
loan_data['zip_code'] = loan_data['zip_code'].fillna(loan_data['zip_code'].mode()[0])

Variables 'last_pymnt_d' & 'last_credit_pull_d'¶

In [44]:
# 'last_credit_pull_d' and 'last_pymnt_d' are date features that give temporal insight into borrower behavior and loan servicing
# Creating time-based features like “months since last payment” or “months since last credit pull” can really help the credit risk model
# understand borrower behavior better. 

# Convert date columns to datetime if they aren't already
loan_data['last_pymnt_d'] = pd.to_datetime(loan_data['last_pymnt_d'], format='mixed', errors='coerce')
loan_data['last_credit_pull_d'] = pd.to_datetime(loan_data['last_credit_pull_d'], format='mixed', errors='coerce')

# Reference date — can use today or a fixed date (e.g., end of data collection period)
reference_date = pd.to_datetime("2020-12-31")  # Replace with appropriate date based on your dataset

# Create new features: months since last payment and last credit pull
loan_data['months_since_last_pymnt'] = round(pd.to_numeric((reference_date - loan_data['last_pymnt_d']) / np.timedelta64(30, 'D')))
loan_data['months_since_last_credit_pull'] = round(pd.to_numeric((reference_date - loan_data['last_credit_pull_d']) / np.timedelta64(30, 'D')))

For the following features, the missing rate is extremely low (~0.0015%), so imputation is safe and won't significantly affect our model. Here’s a breakdown and recommended strategies:¶

  • Categorical Flags / IDs: *'disbursement_method', 'hardship_flag', 'pymnt_plan', 'application_type', 'verification_status', 'home_ownership', 'initial_list_status', 'term', 'purpose', 'addr_state', 'sub_grade', 'grade', 'policy_code'; Use Mode* method; These are categorical. Use the most frequent (mode) value.

  • Interest & Loan Terms: *'int_rate', 'installment', 'term_int', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv'; Use Median*; Numeric and continuous. Median is robust to outliers.

  • Dates: *'issue_d', 'issue_d_date', 'mths_since_issue_d'; Use Most frequent date or calculated from other columns*; Ensure consistency, or recalculate if redundant.

  • FICO Scores: *'fico_range_low', 'fico_range_high', 'last_fico_range_low', 'last_fico_range_high'; Use Median*; Continuous, impute with median.

  • Payment-related: *'last_pymnt_amnt', 'collection_recovery_fee', 'recoveries', 'total_rec_late_fee', 'total_rec_int', 'total_rec_prncp', 'total_pymnt_inv', 'total_pymnt', 'out_prncp_inv', 'out_prncp', 'revol_bal'; Use Median or 0*; If monetary, median; if possibly "no payment", then 0.

  • Unnecessary / Deprecated: *'url'; Use Drop method*; Not useful for modeling — unique URL per loan.

  • Debt Settlement: *'debt_settlement_flag'; Use Mode*; Often 'N' or 'None'; treat as categorical.

In [45]:
# Categorical columns to fill with mode
cat_cols = [
    'disbursement_method', 'hardship_flag', 'pymnt_plan', 'application_type', 'verification_status',
    'home_ownership', 'initial_list_status', 'term', 'purpose', 'addr_state',
    'sub_grade', 'grade', 'policy_code', 'debt_settlement_flag'
]
for col in cat_cols:
    loan_data[col] = loan_data[col].fillna(loan_data[col].mode()[0])

# Numeric columns to fill with median
num_cols = [
    'int_rate', 'installment', 'term_int', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
    'fico_range_low', 'fico_range_high', 'last_fico_range_low', 'last_fico_range_high',
    'last_pymnt_amnt', 'collection_recovery_fee', 'recoveries', 'total_rec_late_fee',
    'total_rec_int', 'total_rec_prncp', 'total_pymnt_inv', 'total_pymnt', 'out_prncp_inv',
    'out_prncp', 'revol_bal'
]
loan_data[num_cols] = loan_data[num_cols].fillna(loan_data[num_cols].median())

# Date columns - fill with mode
date_cols = ['issue_d', 'issue_d_date', 'mths_since_issue_d']
for col in date_cols:
    loan_data[col] = loan_data[col].fillna(loan_data[col].mode()[0])

# Drop 'url' if not used
loan_data = loan_data.drop(columns=['url'])

loan_status (0.0015% missing)¶

  • Meaning: Current status of the loan (e.g., 'Fully Paid', 'Charged Off', 'Current', etc.).

  • Type: Categorical — often the target variable in credit risk models!

  • Strategy:

    • ✅ If you are modeling loan status as the target, you should drop those rows (since we don’t want to guess the target).
    • 🚫 Avoid filling with mode unless you're doing something like survival analysis or lifetime modeling.

Verify if there are remaining missing values¶

In [46]:
# Checking the percentage of the missing values for each category
missing_percent = loan_data.isnull().mean() * 100
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)

# Display the result
print(missing_percent)
loan_status    0.00146
dtype: float64

Now there are no missing values.

In [47]:
# Display a sample of 5 entries.
loan_data.sample(5)
Out[47]:
id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade home_ownership annual_inc verification_status issue_d loan_status pymnt_plan purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit hardship_flag disbursement_method debt_settlement_flag emp_length_int term_int issue_d_date mths_since_issue_d target earliest_cr_line_date mths_since_earliest_cr_line months_since_last_pymnt months_since_last_credit_pull
963330 106165131 18000.0 18000.0 18000.0 60 months 22.74 504.75 E E1 RENT 108000.0 Verified Apr-2017 Current n debt_consolidation Debt consolidation 936xx CA 15.31 1.0 Aug-1994 660.0 664.0 1.0 2.0 999.0 18.0 0.0 12074.0 37.4 37.0 w 13337.99 13337.99 11586.510000 11586.51 4662.01 6924.50 0.0 0.0 0.0 2019-03-01 504.75 Apr-2019 2019-03-01 599.0 595.0 0.0 2.0 1.0 Individual 1.0 0.0 37135.0 1.0 2.0 0.0 1.0 21.0 25061.0 75.0 2.0 4.0 1250.0 57.0 32300.0 0.0 0.0 3.0 5.0 2184.0 4042.0 39.7 0.0 0.0 179.0 272.0 5.0 5.0 1.0 41.0 999.0 3.0 999.0 5.0 3.0 11.0 4.0 8.0 12.0 16.0 24.0 11.0 18.0 1.0 0.0 1.0 2.0 83.8 25.0 0.0 0.0 65700.0 37135.0 6700.0 33400.0 N Cash N 10.0 60.0 2017-04-01 45.0 0 1994-08-01 321.0 22.0 22.0
260707 51006181 20000.0 20000.0 20000.0 60 months 15.61 482.23 D D1 MORTGAGE 92500.0 Verified Jun-2015 Current n debt_consolidation Debt consolidation 920xx CA 13.40 2.0 Jul-1999 665.0 669.0 2.0 15.0 999.0 9.0 0.0 24710.0 90.5 25.0 f 6532.82 6532.82 21683.010000 21683.01 13467.18 8215.83 0.0 0.0 0.0 2019-03-01 482.23 Apr-2019 2019-03-01 699.0 695.0 0.0 15.0 1.0 Individual 0.0 0.0 279997.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 27300.0 0.0 0.0 0.0 4.0 31111.0 646.0 97.1 0.0 0.0 183.0 190.0 4.0 4.0 1.0 4.0 27.0 4.0 27.0 3.0 5.0 7.0 5.0 13.0 8.0 7.0 16.0 7.0 9.0 0.0 0.0 2.0 4.0 79.2 100.0 0.0 0.0 333252.0 25337.0 22300.0 852.0 N Cash N 10.0 60.0 2015-06-01 67.0 0 1999-07-01 261.0 22.0 22.0
1895814 1825127 5600.0 5600.0 5575.0 36 months 12.12 186.33 B B3 RENT 48000.0 Verified Nov-2012 Fully Paid n credit_card Card Refinance 900xx CA 22.09 0.0 Mar-2004 680.0 684.0 0.0 999.0 999.0 9.0 0.0 16158.0 62.1 21.0 f 0.00 0.00 6683.299998 6653.46 5600.00 1083.30 0.0 0.0 0.0 2015-06-01 1097.89 No Payment Due 2017-07-01 509.0 505.0 0.0 999.0 1.0 Individual 0.0 0.0 22808.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 26000.0 0.0 0.0 0.0 1.0 2534.0 512.0 96.5 0.0 0.0 105.0 104.0 44.0 18.0 0.0 44.0 999.0 999.0 999.0 0.0 3.0 6.0 3.0 5.0 11.0 6.0 9.0 6.0 9.0 0.0 0.0 0.0 0.0 100.0 100.0 0.0 0.0 36930.0 22808.0 14600.0 6592.0 N Cash N 10.0 36.0 2012-11-01 98.0 0 2004-03-01 204.0 68.0 43.0
2103815 121429872 14000.0 14000.0 14000.0 60 months 16.02 340.61 C C5 MORTGAGE 51600.0 Verified Nov-2017 Current n debt_consolidation Debt consolidation 945xx CA 32.84 0.0 Sep-1981 685.0 689.0 1.0 999.0 999.0 16.0 0.0 31112.0 53.7 33.0 w 11278.37 11278.37 5437.300000 5437.30 2721.63 2715.67 0.0 0.0 0.0 2019-03-01 340.61 Apr-2019 2019-03-01 654.0 650.0 0.0 999.0 1.0 Joint App 0.0 668.0 434780.0 2.0 2.0 0.0 3.0 17.0 24105.0 83.0 1.0 3.0 5831.0 63.0 57900.0 2.0 0.0 2.0 10.0 27174.0 9703.0 69.6 0.0 0.0 149.0 433.0 6.0 3.0 6.0 24.0 999.0 5.0 999.0 0.0 7.0 11.0 8.0 11.0 7.0 13.0 20.0 11.0 16.0 0.0 0.0 0.0 4.0 100.0 62.5 0.0 0.0 467088.0 55217.0 31900.0 29188.0 N Cash N 10.0 60.0 2017-11-01 38.0 0 1981-09-01 478.0 22.0 22.0
71343 63357608 7500.0 7500.0 7500.0 36 months 11.49 247.29 B B5 RENT 37000.0 Verified Nov-2015 Charged Off n credit_card Credit card refinancing 852xx AZ 18.07 0.0 Aug-2002 675.0 679.0 1.0 66.0 999.0 14.0 0.0 5577.0 90.0 26.0 w 0.00 0.00 6182.620000 6182.62 4928.56 1254.06 0.0 0.0 0.0 2017-12-01 247.29 No Payment Due 2018-09-01 519.0 515.0 0.0 68.0 1.0 Individual 0.0 0.0 52923.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 6200.0 0.0 0.0 0.0 5.0 4071.0 454.0 91.7 0.0 0.0 159.0 107.0 3.0 3.0 0.0 3.0 999.0 3.0 999.0 10.0 2.0 3.0 3.0 3.0 21.0 4.0 5.0 3.0 14.0 0.0 0.0 0.0 2.0 57.7 100.0 0.0 0.0 55298.0 52923.0 5500.0 49098.0 N Cash N 5.0 36.0 2015-11-01 62.0 1 2002-08-01 223.0 38.0 28.0
In [48]:
# Drop the column 'target'
loan_data = loan_data.drop(columns=['target'])
In [49]:
loan_data.shape
Out[49]:
(2260701, 118)

The final version of the dataset is composed of 2260668 rows and 118 columns.¶

II. PD model (Probability of Default)¶

Data preparation¶

Dependent Variable — key target variable for credit risk modeling¶

The 'loan_status' feature is a key target variable for credit risk modeling.

In [50]:
loan_data['loan_status'].unique()
# Displays unique values of 'loan_status' column.
Out[50]:
array(['Fully Paid', 'Current', 'Charged Off', 'In Grace Period',
       'Late (31-120 days)', 'Late (16-30 days)', 'Default', nan,
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'],
      dtype=object)
  • Fully Paid: The borrower repaid the loan in full. ✅ Good outcome.
  • Current: The borrower is still making payments and is on schedule. Ongoing loan.
  • Charged Off: The lender has written off the loan as a loss after severe delinquency. ❌ Bad outcome.
  • Late (31-120 days): Payments are overdue by 31 to 120 days. High-risk. May end in default or charge-off.
  • In Grace Period: Recently missed payment, but within an acceptable grace period (usually <15 days). Moderate risk.
  • Late (16-30 days): Slightly overdue, may still recover. Warning sign.
  • Does not meet the credit policy. Status:Fully Paid: Was funded outside of normal policy but fully paid. You can treat like “Fully Paid”.
  • Does not meet the credit policy. Status:Charged Off: Outside policy and ended in loss. You can treat like “Charged Off”.
  • Default: Officially declared as defaulted. Worst-case scenario. May overlap with Charged Off.
In [54]:
loan_data['loan_status'].value_counts()
# Calculates the number of observations for each unique value of a variable.
Out[54]:
loan_status
Fully Paid                                             1076751
Current                                                 878317
Charged Off                                             268559
Late (31-120 days)                                       21467
In Grace Period                                           8436
Late (16-30 days)                                         4349
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                     40
Name: count, dtype: int64
In [55]:
loan_data['loan_status'].value_counts() / loan_data['loan_status'].count() *100
# We divide the number of observations for each unique value of a variable by the total number of observations.
# Thus, we get the proportion of observations for each unique value of a variable.
Out[55]:
loan_status
Fully Paid                                             47.629771
Current                                                38.852100
Charged Off                                            11.879630
Late (31-120 days)                                      0.949587
In Grace Period                                         0.373164
Late (16-30 days)                                       0.192377
Does not meet the credit policy. Status:Fully Paid      0.087939
Does not meet the credit policy. Status:Charged Off     0.033663
Default                                                 0.001769
Name: count, dtype: float64

Typical Modeling Strategy: Grouping Loan Status¶

To build a binary credit risk model (e.g., Will the borrower default or not?), we will group into "Good" vs. "Bad" loans:

✅ Good:

  • Fully Paid
  • Does not meet the credit policy. Status:Fully Paid

❌Bad:

  • Charged Off
  • Default
  • Late (31-120 days)
  • Late (16-30 days)
  • Does not meet the credit policy. Status:Charged Off

⚠️ Special Cases:

  • Current: The borrower is up to date on payments. However, the loan has not reached maturity — so we don’t yet know if it will default or be fully paid.
  • In Grace Period: The borrower has missed a payment, but is still within the lender’s allowed grace period (typically 15 days). This could still go either way — recovery or default.

✅ Best Practice for Credit Risk Modeling: Treat “Current” and “In Grace Period” as unknown cases and exclude them from model training. Use them later only for prediction/evaluation if needed.

✅ Treating Them as Unknown = Conservative, Trustworthy, and Realistic:

  • These loans haven’t finished their life cycle. Some will default later, others will be fully paid — you just don’t know yet.
  • Including them during training will create label noise and weaken your model's ability to differentiate true risk signals.

📈 Clean Binary Classification = Better Interpretability:

  • You can clearly define:
    • Good (0): Fully Paid
    • Bad (1): Charged Off, Default, and potentially Late
  • Train a robust binary classifier, then apply it to Current loans as future predictions.

For the two statuses:¶

  • 'Does not meet the credit policy. Status:Fully Paid'
  • 'Does not meet the credit policy. Status:Charged Off'

These are special cases flagged by Lending Club: They indicate that the loan didn’t meet Lending Club’s internal credit policy at the time of application, but was still funded (usually manually by investors or for internal testing).

However, they do have known final outcomes:

  • Some ended up Fully Paid.
  • Others ended up Charged Off.

These cases may not be representative of the standard population:

  • Bypassed the normal screening process.
  • Could be riskier or manually approved based on different criteria.
  • Might bias the model slightly if not handled carefully.

Recommended Options:

  • Exclude them for maximum model purity.
  • The model will reflect then only standard Lending Club loan approval logic.
In [56]:
# Step 1: Keep only loans with known outcomes
loan_data_clean = loan_data[loan_data['loan_status'].isin(['Fully Paid','Charged Off','Default',
                                                           'Late (31-120 days)','Late (16-30 days)'])]
In [57]:
# copy the datafile
loan_data_clean = loan_data_clean.copy()

# Good/ Bad Definition
loan_data_clean['good_bad'] = np.where(
    loan_data_clean['loan_status'].isin(['Charged Off', 'Default', 
                                         'Late (31-120 days)', 'Late (16-30 days)']), 1, 0)
# We create a new variable that has the value of '0' if a condition is met, and the value of '1' if it is not met.
In [58]:
# shape of the final cleaned dataset
loan_data_clean.shape
Out[58]:
(1371166, 119)

The cleaned dataset that will be used for the training of the model is composed of 1371166 rows.

Splitting Data¶

In [59]:
from sklearn.model_selection import train_test_split
# Imports the libraries we need.
In [60]:
loan_data_inputs_train, loan_data_inputs_test, loan_data_targets_train, loan_data_targets_test = train_test_split(
              loan_data_clean.drop('good_bad', axis = 1), loan_data_clean['good_bad'], test_size = 0.2, random_state = 42)
# We split two dataframes with inputs and targets, each into a train and test dataframe, and store them in variables.
# This time we set the size of the test dataset to be 20%.
# Respectively, the size of the train dataset becomes 80%.
# We also set a specific random state.
# This would allow us to perform the exact same split multimple times.
# This means, to assign the exact same observations to the train and test datasets.
In [61]:
loan_data_inputs_train.shape
# Displays the size of the dataframe.
Out[61]:
(1096932, 118)
In [62]:
loan_data_targets_train.shape
# Displays the size of the dataframe.
Out[62]:
(1096932,)
In [63]:
loan_data_inputs_test.shape
# Displays the size of the dataframe.
Out[63]:
(274234, 118)
In [64]:
loan_data_targets_test.shape
# Displays the size of the dataframe.
Out[64]:
(274234,)

Save inputs data & targets data for training & testing¶

In [1130]:
#####
#df_inputs_prepr = loan_data_inputs_train
#df_targets_prepr = loan_data_targets_train
#####
df_inputs_prepr = loan_data_inputs_test
df_targets_prepr = loan_data_targets_test

A. Preprocessing Discrete Variables¶

The Weight of Evidence (WoE) is a powerful technique, especially in credit scoring and risk modeling, for transforming categorical or discrete variables into a numerical format that’s both predictive and interpretable.

What is Weight of Evidence (WoE)?¶

Weight of Evidence transforms categorical or binned continuous variables into a numeric scale that measures how strongly a variable predicts the target (usually binary: good vs. bad loan).

  • It's widely used in credit scoring because:
  • It helps handle categorical variables with many levels.
  • It ensures monotonic relationship with the target variable.
  • It works well with logistic regression models.

How to Use WoE with Discrete (Categorical) Variables?¶

  • Group the variable’s categories (or bin if continuous).
  • Count the number of goods and bads in each group.
  • Calculate WoE for each group using the formula above.
  • Replace each category in the original variable with its WoE value.

Why Use WoE?¶

  • Ensures variables have a predictive relationship with the target.
  • Useful for interpretable models like scorecards or logistic regression.
  • Helps identify information value (IV) — a metric to judge a variable’s predictive power.

Bonus: Use WoE Together with Information Value (IV)¶

  • IV < 0.02 → Not predictive.
  • 0.02–0.1 → Weak.
  • 0.1–0.3 → Medium.
  • 0.3–0.5 → Strong
  • higher than 0.5 → Suspiciously powerful (check for data leakage)
In [1131]:
# WoE function for discrete unordered variables
def woe_discrete(df, discrete_variabe_name, good_bad_variable_df):

    # Concatenates two dataframes along the columns.
    df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)

    # Groups the data according to a criterion contained in one column.
    # Does not turn the names of the values of the criterion as indexes.
    # Aggregates the data in another column, using a selected function.
    # In this specific case, we group by the column with index 0 and we aggregate the values of the column with index 1.
    # More specifically, we count them.
    # In other words, we count the values in the column with index 1 for each value of the column with index 0.
    # We calculate then the mean of the values in the column with index 1 for each value of the column with index 0.
    # And concatenate two dataframes along the columns.
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
   
    # Selects only columns with specific indexes.
    df = df.iloc[:, [0, 1, 3]]

    # Changes the names of the columns of a dataframe.
    df.columns = [df.columns.values[0], 'n_obs', 'prop_good']

    # We divide the values of one column by he values of another column and save the result in a new variable.
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()

    # We multiply the values of one column by he values of another column and save the result in a new variable.
    df['n_good'] = df['prop_good'] * df['n_obs']
    df['n_bad'] = (1 - df['prop_good']) * df['n_obs']

    # We calculate the proportion of good and the proportion of bad observations
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()

    # We take the natural logarithm of a variable and save the result in a nex variable.
    # WoE = Weight of Evidence
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
    
    # Sorts a dataframe by the values of a given column.
    df = df.sort_values(['WoE'])

    # We reset the index of a dataframe and overwrite it.
    df = df.reset_index(drop = True)

    # We take the difference between two subsequent values of a column. Then, we take the absolute value of the result.
    df['diff_prop_good'] = df['prop_good'].diff().abs()

    # We take the difference between two subsequent values of a column. Then, we take the absolute value of the result.
    df['diff_WoE'] = df['WoE'].diff().abs()

    # We sum all values of a given column.
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    df['IV'] = df['IV'].sum()
    return df
# The function takes 3 arguments: a dataframe, a string, and a dataframe. The function returns a dataframe as a result.

Preprocessing Discrete Variables: Visualizing Results¶

In [1132]:
# Imports the libraries we need.
sns.set()
# We set the default style of the graphs to the seaborn style. 
In [1133]:
# Below we define a function that takes 2 arguments: a dataframe and a number.
# The number parameter has a default value of 0.
# This means that if we call the function and omit the number parameter, it will be executed with it having a value of 0.
# The function displays a graph.
def plot_by_woe(df_WoE, rotation_of_x_axis_labels = 0):
    x = np.array(df_WoE.iloc[:, 0].apply(str))
    # Turns the values of the column with index 0 to strings, makes an array from these strings, and passes it to variable x.
    y = df_WoE['WoE']
    # Selects a column with label 'WoE' and passes it to variable y.
    plt.figure(figsize=(18, 6))
    # Sets the graph size to width 18 x height 6.
    plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k')
    # Plots the datapoints with coordiantes variable x on the x-axis and variable y on the y-axis.
    # Sets the marker for each datapoint to a circle, the style line between the points to dashed, and the color to black.
    plt.xlabel(df_WoE.columns[0])
    # Names the x-axis with the name of the column with index 0.
    plt.ylabel('Weight of Evidence')
    # Names the y-axis 'Weight of Evidence'.
    plt.title(str('Weight of Evidence by ' + df_WoE.columns[0]))
    # Names the grapth 'Weight of Evidence by ' the name of the column with index 0.
    plt.xticks(rotation = rotation_of_x_axis_labels)
    # Rotates the labels of the x-axis a predefined number of degrees.

List of the categorical variables in the dataset¶

In [1134]:
# Check the list of the categorical features of the dataset
categorical_vars = loan_data_clean.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_vars)
print()
print('Number of categorical variables :   ',len(categorical_vars))
['id', 'term', 'grade', 'sub_grade', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'initial_list_status', 'next_pymnt_d', 'application_type', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag']

Number of categorical variables :    20

Certain categorical variables should be dropped from the training of the credit risk model:¶

  • id : Unique identifier – drop it entirely
  • emp_title: Very high cardinality – can be binned or dropped unless NLP is used
  • title: Free text similar to purpose – often redundant/noisy
  • zip_code: Contains geographic info, but only first 3 digits – may not generalize
  • earliest_cr_line, 'earliest_cr_line_date': Date-type, not categorical — age of credit history is instead extracted from them.
  • next_pymnt_d: Future payment date — may not be useful for initial risk prediction
  • issue_d, 'issue_d_date': Loan issue date — convert to numeric loan age
  • pymnt_plan: Usually 'n' (no) for nearly all loans – low variance, maybe drop
  • 'term': Converted to another numerical variable 'term_int'
  • 'last_pymnt_d': Date — convert to numeric age
  • 'last_credit_pull_d': Date — convert to numeric age
  • loan_status: Target variable — use only to define target, not as a feature
In [1135]:
list_col_to_drop = ['id', 'title', 'zip_code', 'earliest_cr_line', 'earliest_cr_line_date', 'next_pymnt_d', 'issue_d',
                    'issue_d_date', 'pymnt_plan',  'term', 'last_pymnt_d', 'last_credit_pull_d', 'loan_status']

df_inputs_prepr = df_inputs_prepr.drop(columns = list_col_to_drop)
In [1136]:
df_inputs_prepr.shape 
Out[1136]:
(274234, 105)

The final vesrion of the training and test datasets are composed by 105 features.

In [1137]:
# Check the list of the categorical features of the dataset
categorical_vars_final = df_inputs_prepr.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_vars_final)
['grade', 'sub_grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state', 'initial_list_status', 'application_type', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag']

There area 11 categorical features.

In [1138]:
# Unique values of 'grade' feature
df_inputs_prepr['grade'].unique()
Out[1138]:
array(['C', 'D', 'A', 'E', 'F', 'B', 'G'], dtype=object)
In [1139]:
# Unique values of 'sub_grade' feature
df_inputs_prepr['sub_grade'].unique()
Out[1139]:
array(['C4', 'D2', 'C1', 'A5', 'E3', 'C3', 'C5', 'F4', 'E2', 'B2', 'A1',
       'C2', 'D3', 'B1', 'B4', 'E5', 'B5', 'D1', 'B3', 'A4', 'A3', 'E1',
       'D5', 'F5', 'F2', 'G4', 'A2', 'D4', 'F1', 'G1', 'G5', 'G2', 'G3',
       'F3', 'E4'], dtype=object)

Variable 'grade'¶

Avoid using both 'grade' and 'sub_grade' together:

  • Multicollinearity risk: grade is derived from sub_grade, so including both introduces strong correlation.
  • It adds noise rather than new signal.

We choose to keep the variable 'grade' and to drop the variable 'sub_grade'.

In [1140]:
# Drop the 'sub_grade' column.
df_inputs_prepr = df_inputs_prepr.drop(columns = ['sub_grade'])
In [1141]:
# 'grade'
df_temp = woe_discrete(df_inputs_prepr, 'grade', df_targets_prepr)
# We execute the function we defined with the necessary arguments: a dataframe, a string, and a dataframe.
# We store the result in a dataframe.
df_temp
Out[1141]:
grade n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 A 47247 0.068428 0.172287 3233.0 44014.0 0.054981 0.204306 -1.312628 NaN NaN 0.437964
1 B 79943 0.146042 0.291514 11675.0 68268.0 0.198548 0.316889 -0.467522 0.077614 0.845106 0.437964
2 C 77916 0.241825 0.284122 18842.0 59074.0 0.320431 0.274212 0.155767 0.095783 0.623289 0.437964
3 D 41485 0.321827 0.151276 13351.0 28134.0 0.227050 0.130593 0.553082 0.080003 0.397315 0.437964
4 E 19227 0.402403 0.070112 7737.0 11490.0 0.131577 0.053335 0.903006 0.080576 0.349924 0.437964
5 F 6599 0.463252 0.024063 3057.0 3542.0 0.051988 0.016441 1.151212 0.060849 0.248206 0.437964
6 G 1817 0.499174 0.006626 907.0 910.0 0.015425 0.004224 1.295167 0.035922 0.143955 0.437964
In [1142]:
plot_by_woe(df_temp)
# We execute the function we defined with the necessary arguments: a dataframe.
# We omit the number argument, which means the function will use its default value, 0.
No description has been provided for this image
In [1143]:
df_var_dummies =[pd.get_dummies(df_inputs_prepr['grade'], prefix = 'grade', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.

df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.

df_inputs_prepr = pd.concat([df_inputs_prepr, df_var_dummies], axis = 1)
# Concatenates two dataframes.
# Here we concatenate the dataframe with original data with the dataframe with dummy variables, along the columns. 
In [1144]:
df_inputs_prepr.head()
Out[1144]:
loan_amnt funded_amnt funded_amnt_inv int_rate installment grade home_ownership annual_inc verification_status purpose addr_state dti delinq_2yrs fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_amnt last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit hardship_flag disbursement_method debt_settlement_flag emp_length_int term_int mths_since_issue_d mths_since_earliest_cr_line months_since_last_pymnt months_since_last_credit_pull grade:A grade:B grade:C grade:D grade:E grade:F grade:G
299291 12000.0 12000.0 12000.0 13.99 279.16 C OWN 30000.0 Source Verified debt_consolidation SD 25.32 0.0 675.0 679.0 1.0 76.0 999.0 19.0 0.0 11405.0 60.3 35.0 w 0.0 0.0 13667.840000 13667.84 12000.00 1667.84 0.0 0.00 0.0000 10615.73 574.0 570.0 0.0 999.0 1.0 Individual 0.0 0.0 88510.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 18900.0 0.0 0.0 0.0 4.0 4658.0 1221.0 83.3 0.0 0.0 127.0 121.0 15.0 5.0 0.0 36.0 76.0 5.0 76.0 0.0 6.0 11.0 6.0 14.0 7.0 14.0 28.0 11.0 19.0 0.0 0.0 0.0 1.0 97.1 66.7 0.0 0.0 97351.0 88510.0 7300.0 78451.0 N Cash N 7.0 60.0 68.0 198.0 57.0 22.0 False False True False False False False
2099335 35000.0 35000.0 35000.0 18.06 889.92 D RENT 140000.0 Source Verified debt_consolidation CA 20.49 0.0 695.0 699.0 2.0 999.0 999.0 12.0 0.0 30808.0 20.0 18.0 w 0.0 0.0 11582.320000 11582.32 3063.09 3986.04 0.0 4533.19 815.9742 889.92 574.0 570.0 0.0 999.0 1.0 Individual 0.0 0.0 91853.0 3.0 4.0 4.0 4.0 4.0 61045.0 87.0 2.0 5.0 11140.0 20.0 157000.0 0.0 2.0 4.0 9.0 7654.0 19625.0 20.0 0.0 0.0 93.0 72.0 3.0 3.0 0.0 3.0 999.0 3.0 999.0 0.0 8.0 8.0 8.0 8.0 9.0 8.0 9.0 8.0 12.0 0.0 0.0 0.0 6.0 100.0 0.0 0.0 0.0 227204.0 91853.0 157000.0 70204.0 N Cash N 2.0 60.0 38.0 132.0 30.0 24.0 False False False True False False False
113647 8400.0 8400.0 8400.0 12.29 280.17 C MORTGAGE 70495.0 Verified other GA 16.04 0.0 660.0 664.0 1.0 44.0 999.0 19.0 0.0 16940.0 94.1 39.0 w 0.0 0.0 10008.154816 10008.15 8400.00 1608.15 0.0 0.00 0.0000 196.31 709.0 705.0 0.0 45.0 1.0 Individual 0.0 79.0 145252.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 18000.0 0.0 0.0 0.0 8.0 7645.0 207.0 98.0 0.0 0.0 98.0 101.0 3.0 3.0 0.0 16.0 44.0 3.0 44.0 2.0 2.0 5.0 2.0 3.0 28.0 5.0 10.0 5.0 19.0 0.0 0.0 0.0 2.0 92.1 100.0 0.0 0.0 142744.0 145252.0 10500.0 124664.0 N Cash N 2.0 36.0 63.0 165.0 28.0 28.0 False False True False False False False
180785 18000.0 18000.0 18000.0 7.89 563.15 A RENT 68000.0 Not Verified debt_consolidation NV 11.12 0.0 705.0 709.0 0.0 81.0 999.0 6.0 0.0 7540.0 55.4 21.0 w 0.0 0.0 20235.797565 20235.80 18000.00 2235.80 0.0 0.00 0.0000 2230.78 714.0 710.0 0.0 999.0 1.0 Individual 0.0 0.0 33713.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 13600.0 0.0 0.0 0.0 2.0 5619.0 4353.0 63.1 0.0 0.0 94.0 96.0 9.0 9.0 0.0 9.0 81.0 21.0 81.0 0.0 3.0 4.0 3.0 5.0 14.0 4.0 7.0 4.0 6.0 0.0 0.0 0.0 1.0 94.7 33.3 0.0 0.0 46695.0 33713.0 11800.0 33095.0 N Cash N 0.0 36.0 65.0 162.0 32.0 27.0 True False False False False False False
1805875 12000.0 12000.0 12000.0 21.60 455.81 E RENT 165000.0 Not Verified credit_card NY 7.53 1.0 680.0 684.0 3.0 10.0 999.0 12.0 0.0 7883.0 47.2 25.0 f 0.0 0.0 13956.129946 13956.13 12000.00 1956.13 0.0 0.00 0.0000 9854.71 749.0 745.0 0.0 10.0 1.0 Individual 0.0 0.0 26652.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 16700.0 0.0 0.0 0.0 10.0 2221.0 327.0 85.1 0.0 0.0 22.0 162.0 2.0 2.0 0.0 2.0 34.0 0.0 34.0 1.0 2.0 7.0 2.0 11.0 2.0 9.0 22.0 7.0 12.0 0.0 0.0 1.0 5.0 86.0 100.0 0.0 0.0 42613.0 26652.0 2200.0 16500.0 N Cash N 2.0 36.0 88.0 251.0 79.0 22.0 False False False False True False False

Variable 'home_ownership'¶

In [1145]:
# 'home_ownership'
df_temp = woe_discrete(df_inputs_prepr, 'home_ownership', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1145]:
home_ownership n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 NONE 12 0.166667 0.000044 2.0 10.0 0.000034 0.000046 -0.310968 NaN NaN 0.031033
1 MORTGAGE 135489 0.185329 0.494063 25110.0 110379.0 0.427026 0.512361 -0.182184 0.018662 0.128784 0.031033
2 OWN 29698 0.223012 0.108294 6623.0 23075.0 0.112632 0.107110 0.050268 0.037683 0.232452 0.031033
3 RENT 108950 0.248233 0.397288 27045.0 81905.0 0.459933 0.380190 0.190412 0.025221 0.140143 0.031033
4 OTHER 36 0.250000 0.000131 9.0 27.0 0.000153 0.000125 0.199857 0.001767 0.009446 0.031033
5 ANY 49 0.265306 0.000179 13.0 36.0 0.000221 0.000167 0.279900 0.015306 0.080043 0.031033
In [1146]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1147]:
df_var_dummies = [pd.get_dummies(df_inputs_prepr['home_ownership'], prefix = 'home_ownership', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.

df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
In [1148]:
# There are many categories with very few observations and many categories with very different "good" %.
# Therefore, we create a new discrete variable where we combine some of the categories.
# 'OTHERS' and 'NONE' are riskiest but are very few. 'RENT' is the next riskiest.
# 'ANY' are least risky but are too few. Conceptually, they belong to the same category. Also, their inclusion would not change anything.
# We combine them in one category, 'RENT_OTHER_NONE_ANY'.
# We end up with 3 categories: 'RENT_OTHER_NONE_ANY', 'OWN', 'MORTGAGE'.
df_var_dummies['home_ownership:RENT_OTHER_NONE_ANY'] = sum([df_var_dummies['home_ownership:RENT'], df_var_dummies['home_ownership:OTHER'],
                                                            df_var_dummies['home_ownership:NONE'], df_var_dummies['home_ownership:ANY']])
# 'RENT_OTHER_NONE_ANY' will be the reference category.
In [1149]:
df_var_dummies = df_var_dummies.drop(columns = ['home_ownership:RENT', 'home_ownership:OTHER', 'home_ownership:NONE', 'home_ownership:ANY'])
# Drop the dummy variables that are grouped together 

df_inputs_prepr = pd.concat([df_inputs_prepr, df_var_dummies], axis = 1)
# Concatenates two dataframes.
# Here we concatenate the dataframe with original data with the dataframe with dummy variables, along the columns. 

Variable 'addr_state'¶

In [1150]:
# 'addr_state'
df_inputs_prepr['addr_state'].unique()
Out[1150]:
array(['SD', 'CA', 'GA', 'NV', 'NY', 'PA', 'MD', 'MT', 'FL', 'CO', 'NJ',
       'CT', 'TX', 'WV', 'MA', 'NM', 'NC', 'IL', 'AZ', 'IN', 'MO', 'HI',
       'NE', 'NH', 'WA', 'KY', 'SC', 'TN', 'MN', 'LA', 'RI', 'VA', 'UT',
       'AL', 'ND', 'OH', 'MI', 'ID', 'KS', 'DE', 'OR', 'WY', 'AK', 'WI',
       'ME', 'AR', 'MS', 'OK', 'VT', 'DC', 'IA'], dtype=object)
In [1151]:
df_temp = woe_discrete(df_inputs_prepr, 'addr_state', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\anaconda3\envs\envname\Lib\site-packages\pandas\core\arraylike.py:399: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[1151]:
addr_state n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 IA 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 -inf NaN NaN inf
1 ME 395 0.121519 0.001440 48.0 347.0 0.000816 0.001611 -0.679654 0.121519 inf inf
2 NH 1293 0.146945 0.004715 190.0 1103.0 0.003231 0.005120 -0.460296 0.025426 0.219359 inf
3 DC 696 0.152299 0.002538 106.0 590.0 0.001803 0.002739 -0.418214 0.005354 0.042082 inf
4 OR 3271 0.153470 0.011928 502.0 2769.0 0.008537 0.012853 -0.409172 0.001171 0.009042 inf
5 VT 517 0.160542 0.001885 83.0 434.0 0.001412 0.002015 -0.355734 0.007072 0.053437 inf
6 KS 2261 0.163644 0.008245 370.0 1891.0 0.006292 0.008778 -0.332889 0.003103 0.022846 inf
7 CO 6119 0.165550 0.022313 1013.0 5106.0 0.017227 0.023701 -0.319031 0.001906 0.013858 inf
8 WY 601 0.166389 0.002192 100.0 501.0 0.001701 0.002326 -0.312966 0.000839 0.006064 inf
9 WV 1015 0.166502 0.003701 169.0 846.0 0.002874 0.003927 -0.312151 0.000113 0.000815 inf
10 RI 1226 0.177814 0.004471 218.0 1008.0 0.003707 0.004679 -0.232759 0.011312 0.079392 inf
11 WA 5989 0.182167 0.021839 1091.0 4898.0 0.018554 0.022736 -0.203263 0.004353 0.029496 inf
12 ND 318 0.182390 0.001160 58.0 260.0 0.000986 0.001207 -0.201769 0.000223 0.001494 inf
13 SC 3262 0.182403 0.011895 595.0 2667.0 0.010119 0.012380 -0.201679 0.000013 0.000091 inf
14 UT 2064 0.185078 0.007526 382.0 1682.0 0.006496 0.007808 -0.183849 0.002674 0.017830 inf
15 CT 4031 0.188787 0.014699 761.0 3270.0 0.012942 0.015179 -0.159442 0.003709 0.024406 inf
16 IL 10477 0.189367 0.038205 1984.0 8493.0 0.033740 0.039423 -0.155658 0.000580 0.003785 inf
17 WI 3544 0.189898 0.012923 673.0 2871.0 0.011445 0.013327 -0.152201 0.000531 0.003457 inf
18 MA 6381 0.200282 0.023268 1278.0 5103.0 0.021734 0.023687 -0.086063 0.010384 0.066138 inf
19 HI 1360 0.201471 0.004959 274.0 1086.0 0.004660 0.005041 -0.078659 0.001189 0.007404 inf
20 GA 8846 0.203595 0.032257 1801.0 7045.0 0.030628 0.032702 -0.065507 0.002124 0.013152 inf
21 DE 765 0.205229 0.002790 157.0 608.0 0.002670 0.002822 -0.055460 0.001634 0.010047 inf
22 MN 4862 0.205265 0.017729 998.0 3864.0 0.016972 0.017936 -0.055235 0.000037 0.000224 inf
23 MT 761 0.210250 0.002775 160.0 601.0 0.002721 0.002790 -0.024952 0.004984 0.030284 inf
24 SD 585 0.211966 0.002133 124.0 461.0 0.002109 0.002140 -0.014647 0.001716 0.010305 inf
25 CA 40010 0.213472 0.145897 8541.0 31469.0 0.145250 0.146074 -0.005655 0.001506 0.008992 inf
26 NC 7750 0.215097 0.028261 1667.0 6083.0 0.028349 0.028236 0.003997 0.001625 0.009652 inf
27 TX 22218 0.215276 0.081018 4783.0 17435.0 0.081341 0.080930 0.005058 0.000179 0.001061 inf
28 AZ 6583 0.215707 0.024005 1420.0 5163.0 0.024149 0.023966 0.007609 0.000431 0.002551 inf
29 AK 649 0.215716 0.002367 140.0 509.0 0.002381 0.002363 0.007664 0.000009 0.000055 inf
30 KY 2642 0.216124 0.009634 571.0 2071.0 0.009711 0.009613 0.010072 0.000408 0.002408 inf
31 IN 4508 0.218057 0.016439 983.0 3525.0 0.016717 0.016362 0.021443 0.001933 0.011371 inf
32 MO 4352 0.218061 0.015870 949.0 3403.0 0.016139 0.015796 0.021466 0.000004 0.000023 inf
33 MI 7295 0.218780 0.026601 1596.0 5699.0 0.027142 0.026454 0.025679 0.000719 0.004214 inf
34 OH 8954 0.220013 0.032651 1970.0 6984.0 0.033502 0.032419 0.032881 0.001233 0.007202 inf
35 PA 9237 0.220526 0.033683 2037.0 7200.0 0.034642 0.033421 0.035867 0.000513 0.002985 inf
36 VA 7728 0.223732 0.028180 1729.0 5999.0 0.029404 0.027846 0.054420 0.003206 0.018553 inf
37 TN 4096 0.225342 0.014936 923.0 3173.0 0.015697 0.014729 0.063666 0.001610 0.009246 inf
38 FL 19831 0.227220 0.072314 4506.0 15325.0 0.076630 0.071136 0.074394 0.001878 0.010728 inf
39 NJ 9932 0.227547 0.036217 2260.0 7672.0 0.038434 0.035612 0.076257 0.000327 0.001863 inf
40 NM 1464 0.228142 0.005339 334.0 1130.0 0.005680 0.005245 0.079638 0.000595 0.003381 inf
41 NV 4091 0.230017 0.014918 941.0 3150.0 0.016003 0.014622 0.090255 0.001875 0.010617 inf
42 MD 6357 0.236118 0.023181 1501.0 4856.0 0.025526 0.022541 0.124386 0.006101 0.034131 inf
43 NY 22427 0.238329 0.081781 5345.0 17082.0 0.090898 0.079292 0.136606 0.002211 0.012220 inf
44 AL 3323 0.242552 0.012117 806.0 2517.0 0.013707 0.011684 0.159730 0.004223 0.023124 inf
45 LA 3183 0.251021 0.011607 799.0 2384.0 0.013588 0.011066 0.205295 0.008469 0.045565 inf
46 ID 307 0.254072 0.001119 78.0 229.0 0.001326 0.001063 0.221456 0.003051 0.016161 inf
47 OK 2590 0.264479 0.009444 685.0 1905.0 0.011649 0.008843 0.275651 0.010407 0.054195 inf
48 AR 2022 0.266568 0.007373 539.0 1483.0 0.009166 0.006884 0.286363 0.002089 0.010712 inf
49 NE 741 0.271255 0.002702 201.0 540.0 0.003418 0.002507 0.310205 0.004687 0.023843 inf
50 MS 1304 0.278374 0.004755 363.0 941.0 0.006173 0.004368 0.345929 0.007119 0.035724 inf
In [1152]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1153]:
plot_by_woe(df_temp.iloc[2: -2, : ])
# We plot the weight of evidence values.
No description has been provided for this image
In [1154]:
plot_by_woe(df_temp.iloc[6: -6, : ])
# We plot the weight of evidence values.
No description has been provided for this image
In [1155]:
df_var_dummies = [pd.get_dummies(df_inputs_prepr['addr_state'], prefix = 'addr_state', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.

df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
In [1156]:
# We create the following categories:
# 'ND' 'NE' 'IA' NV' 'FL' 'HI' 'AL'
# 'NM' 'VA'
# 'NY'
# 'OK' 'TN' 'MO' 'LA' 'MD' 'NC'
# 'CA'
# 'UT' 'KY' 'AZ' 'NJ'
# 'AR' 'MI' 'PA' 'OH' 'MN'
# 'RI' 'MA' 'DE' 'SD' 'IN'
# 'GA' 'WA' 'OR'
# 'WI' 'MT'
# 'TX'
# 'IL' 'CT'
# 'KS' 'SC' 'CO' 'VT' 'AK' 'MS'
# 'WV' 'NH' 'WY' 'DC' 'ME' 'ID'

# 'IA_NV_HI_ID_AL_FL' will be the reference category.

df_inputs_prepr['addr_state:ND_NE_IA_NV_FL_HI_AL'] = sum([df_var_dummies['addr_state:ND'], df_var_dummies['addr_state:NE'],
                                              df_var_dummies['addr_state:IA'], df_var_dummies['addr_state:NV'],
                                              df_var_dummies['addr_state:FL'], df_var_dummies['addr_state:HI'],
                                                          df_var_dummies['addr_state:AL']])

df_inputs_prepr['addr_state:NM_VA'] = sum([df_var_dummies['addr_state:NM'], df_var_dummies['addr_state:VA']])

df_inputs_prepr['addr_state:OK_TN_MO_LA_MD_NC'] = sum([df_var_dummies['addr_state:OK'], df_var_dummies['addr_state:TN'],
                                                       df_var_dummies['addr_state:MO'], df_var_dummies['addr_state:LA'],
                                                       df_var_dummies['addr_state:MD'], df_var_dummies['addr_state:NC']])

df_inputs_prepr['addr_state:UT_KY_AZ_NJ'] = sum([df_var_dummies['addr_state:UT'], df_var_dummies['addr_state:KY'],
                                                 df_var_dummies['addr_state:AZ'], df_var_dummies['addr_state:NJ']])

df_inputs_prepr['addr_state:AR_MI_PA_OH_MN'] = sum([df_var_dummies['addr_state:AR'], df_var_dummies['addr_state:MI'],
                                                    df_var_dummies['addr_state:PA'], df_var_dummies['addr_state:OH'],
                                                    df_var_dummies['addr_state:MN']])

df_inputs_prepr['addr_state:RI_MA_DE_SD_IN'] = sum([df_var_dummies['addr_state:RI'], df_var_dummies['addr_state:MA'],
                                                    df_var_dummies['addr_state:DE'], df_var_dummies['addr_state:SD'],
                                                    df_var_dummies['addr_state:IN']])

df_inputs_prepr['addr_state:GA_WA_OR'] = sum([df_var_dummies['addr_state:GA'], df_var_dummies['addr_state:WA'],
                                              df_var_dummies['addr_state:OR']])

df_inputs_prepr['addr_state:WI_MT'] = sum([df_var_dummies['addr_state:WI'], df_var_dummies['addr_state:MT']])

df_inputs_prepr['addr_state:IL_CT'] = sum([df_var_dummies['addr_state:IL'], df_var_dummies['addr_state:CT']])

df_inputs_prepr['addr_state:KS_SC_CO_VT_AK_MS'] = sum([df_var_dummies['addr_state:KS'], df_var_dummies['addr_state:SC'],
                                                       df_var_dummies['addr_state:CO'], df_var_dummies['addr_state:VT'],
                                                       df_var_dummies['addr_state:AK'], df_var_dummies['addr_state:MS']])

df_inputs_prepr['addr_state:WV_NH_WY_DC_ME_ID'] = sum([df_var_dummies['addr_state:WV'], df_var_dummies['addr_state:NH'],
                                                       df_var_dummies['addr_state:WY'], df_var_dummies['addr_state:DC'],
                                                       df_var_dummies['addr_state:ME'], df_var_dummies['addr_state:ID']])

Variable 'verification_status'¶

In [1157]:
# 'verification_status'
df_temp = woe_discrete(df_inputs_prepr, 'verification_status', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1157]:
verification_status n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 Not Verified 82536 0.160887 0.300969 13279.0 69257.0 0.225826 0.321480 -0.353171 NaN NaN 0.050456
1 Source Verified 106486 0.225344 0.388303 23996.0 82490.0 0.408081 0.382905 0.063680 0.064457 0.416850 0.050456
2 Verified 85212 0.252629 0.310727 21527.0 63685.0 0.366093 0.295615 0.213828 0.027285 0.150149 0.050456
In [1158]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1159]:
df_var_dummies = [pd.get_dummies(df_inputs_prepr['verification_status'], prefix = 'verification_status', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.

df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
In [1160]:
df_inputs_prepr = pd.concat([df_inputs_prepr, df_var_dummies], axis = 1)
# Concatenates two dataframes.
# Here we concatenate the dataframe with original data with the dataframe with dummy variables, along the columns. 

Variable 'purpose'¶

In [1161]:
# 'purpose'
df_temp = woe_discrete(df_inputs_prepr, 'purpose', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1161]:
purpose n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 wedding 456 0.133772 0.001663 61.0 395.0 0.001037 0.001834 -0.569542 NaN NaN 0.019054
1 car 2922 0.148186 0.010655 433.0 2489.0 0.007364 0.011554 -0.450429 0.014414 0.119113 0.019054
2 educational 66 0.181818 0.000241 12.0 54.0 0.000204 0.000251 -0.205608 0.033632 0.244821 0.019054
3 credit_card 59492 0.182024 0.216939 10829.0 48663.0 0.184160 0.225886 -0.204222 0.000206 0.001386 0.019054
4 home_improvement 17962 0.195246 0.065499 3507.0 14455.0 0.059641 0.067098 -0.117810 0.013221 0.086412 0.019054
5 major_purchase 6053 0.198744 0.022072 1203.0 4850.0 0.020458 0.022513 -0.095691 0.003499 0.022119 0.019054
6 vacation 1864 0.217275 0.006797 405.0 1459.0 0.006888 0.006772 0.016850 0.018530 0.112541 0.019054
7 debt_consolidation 159235 0.226244 0.580654 36026.0 123209.0 0.612666 0.571916 0.068828 0.008970 0.051978 0.019054
8 other 16206 0.228372 0.059096 3701.0 12505.0 0.062940 0.058046 0.080944 0.002128 0.012116 0.019054
9 moving 2014 0.237339 0.007344 478.0 1536.0 0.008129 0.007130 0.131143 0.008966 0.050199 0.019054
10 medical 3246 0.240604 0.011837 781.0 2465.0 0.013282 0.011442 0.149098 0.003265 0.017954 0.019054
11 house 1477 0.247800 0.005386 366.0 1111.0 0.006224 0.005157 0.188087 0.007196 0.038989 0.019054
12 renewable_energy 173 0.254335 0.000631 44.0 129.0 0.000748 0.000599 0.222847 0.006536 0.034760 0.019054
13 small_business 3068 0.311604 0.011188 956.0 2112.0 0.016258 0.009804 0.505837 0.057268 0.282990 0.019054
In [1162]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1163]:
df_var_dummies = [pd.get_dummies(df_inputs_prepr['purpose'], prefix = 'purpose', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.

df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
In [1164]:
# We combine 'small_business', 'moving', 'renewable_energy', 'house' and 'medical' in one category: 'sm_b__mov__ren_en__house__medic'.
# We combine 'other', 'vacation' and 'major_purchase' in one category: other__vacat__maj_purch'.
# We combine 'home_improvement', ''wedding'educational', 'car' and  in one category: 'home_impr__educ__car__wed'.
# We leave 'debt_consolidtion' in a separate category.
# We leave 'credit_card' in a separate category.
#'sm_b__mov__ren_en__house__medic' will be the reference category.
df_inputs_prepr['purpose:debt_consolidation'] = df_var_dummies['purpose:debt_consolidation']
df_inputs_prepr['purpose:credit_card'] = df_var_dummies['purpose:credit_card']
df_inputs_prepr['purpose:sm_b__mov__ren_en__house__medic'] = sum([df_var_dummies['purpose:small_business'], df_var_dummies['purpose:moving'],
                                                                  df_var_dummies['purpose:renewable_energy'],df_var_dummies['purpose:house'],
                                                                  df_var_dummies['purpose:medical']])
df_inputs_prepr['purpose:other__vacat__maj_purch'] = sum([df_var_dummies['purpose:other'], df_var_dummies['purpose:major_purchase'],
                                                          df_var_dummies['purpose:vacation']])
df_inputs_prepr['purpose:home_impr__educ__car__wed'] = sum([df_var_dummies['purpose:home_improvement'], df_var_dummies['purpose:car'],
                                                              df_var_dummies['purpose:home_improvement'],  df_var_dummies['purpose:wedding']])

Variables: 'initial_list_status', 'application_type', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag'¶

Each of these variables has only 2 unique categories.

In [1165]:
loan_data_dummies = [pd.get_dummies(df_inputs_prepr['initial_list_status'], prefix = 'initial_list_status', prefix_sep = ':'),
                     pd.get_dummies(df_inputs_prepr['application_type'], prefix = 'application_type', prefix_sep = ':'),
                     pd.get_dummies(df_inputs_prepr['hardship_flag'], prefix = 'hardship_flag', prefix_sep = ':'),
                     pd.get_dummies(df_inputs_prepr['disbursement_method'], prefix = 'disbursement_method', prefix_sep = ':'),
                     pd.get_dummies(df_inputs_prepr['debt_settlement_flag'], prefix = 'debt_settlement_flag', prefix_sep = ':')]
# We create dummy variables from all these original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.
In [1166]:
loan_data_dummies = pd.concat(loan_data_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.

df_inputs_prepr = pd.concat([df_inputs_prepr, loan_data_dummies], axis = 1)
# Concatenates two dataframes.
# Here we concatenate the dataframe with original data with the dataframe with dummy variables, along the columns. 

convert all True and False values to 1 and 0

In [1167]:
df_inputs_prepr = df_inputs_prepr.replace({True: 1, False: 0})
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2025257523.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df_inputs_prepr = df_inputs_prepr.replace({True: 1, False: 0})
In [1168]:
df_inputs_prepr.head()
Out[1168]:
loan_amnt funded_amnt funded_amnt_inv int_rate installment grade home_ownership annual_inc verification_status purpose addr_state dti delinq_2yrs fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_amnt last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit hardship_flag disbursement_method debt_settlement_flag emp_length_int term_int mths_since_issue_d mths_since_earliest_cr_line months_since_last_pymnt months_since_last_credit_pull grade:A grade:B grade:C grade:D grade:E grade:F grade:G home_ownership:MORTGAGE home_ownership:OWN home_ownership:RENT_OTHER_NONE_ANY addr_state:ND_NE_IA_NV_FL_HI_AL addr_state:NM_VA addr_state:OK_TN_MO_LA_MD_NC addr_state:UT_KY_AZ_NJ addr_state:AR_MI_PA_OH_MN addr_state:RI_MA_DE_SD_IN addr_state:GA_WA_OR addr_state:WI_MT addr_state:IL_CT addr_state:KS_SC_CO_VT_AK_MS addr_state:WV_NH_WY_DC_ME_ID verification_status:Not Verified verification_status:Source Verified verification_status:Verified purpose:debt_consolidation purpose:credit_card purpose:sm_b__mov__ren_en__house__medic purpose:other__vacat__maj_purch purpose:home_impr__educ__car__wed initial_list_status:f initial_list_status:w application_type:Individual application_type:Joint App hardship_flag:N hardship_flag:Y disbursement_method:Cash disbursement_method:DirectPay debt_settlement_flag:N debt_settlement_flag:Y
299291 12000.0 12000.0 12000.0 13.99 279.16 C OWN 30000.0 Source Verified debt_consolidation SD 25.32 0.0 675.0 679.0 1.0 76.0 999.0 19.0 0.0 11405.0 60.3 35.0 w 0.0 0.0 13667.840000 13667.84 12000.00 1667.84 0.0 0.00 0.0000 10615.73 574.0 570.0 0.0 999.0 1.0 Individual 0.0 0.0 88510.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 18900.0 0.0 0.0 0.0 4.0 4658.0 1221.0 83.3 0.0 0.0 127.0 121.0 15.0 5.0 0.0 36.0 76.0 5.0 76.0 0.0 6.0 11.0 6.0 14.0 7.0 14.0 28.0 11.0 19.0 0.0 0.0 0.0 1.0 97.1 66.7 0.0 0.0 97351.0 88510.0 7300.0 78451.0 N Cash N 7.0 60.0 68.0 198.0 57.0 22.0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 1 0 1 0
2099335 35000.0 35000.0 35000.0 18.06 889.92 D RENT 140000.0 Source Verified debt_consolidation CA 20.49 0.0 695.0 699.0 2.0 999.0 999.0 12.0 0.0 30808.0 20.0 18.0 w 0.0 0.0 11582.320000 11582.32 3063.09 3986.04 0.0 4533.19 815.9742 889.92 574.0 570.0 0.0 999.0 1.0 Individual 0.0 0.0 91853.0 3.0 4.0 4.0 4.0 4.0 61045.0 87.0 2.0 5.0 11140.0 20.0 157000.0 0.0 2.0 4.0 9.0 7654.0 19625.0 20.0 0.0 0.0 93.0 72.0 3.0 3.0 0.0 3.0 999.0 3.0 999.0 0.0 8.0 8.0 8.0 8.0 9.0 8.0 9.0 8.0 12.0 0.0 0.0 0.0 6.0 100.0 0.0 0.0 0.0 227204.0 91853.0 157000.0 70204.0 N Cash N 2.0 60.0 38.0 132.0 30.0 24.0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 1 0 1 0
113647 8400.0 8400.0 8400.0 12.29 280.17 C MORTGAGE 70495.0 Verified other GA 16.04 0.0 660.0 664.0 1.0 44.0 999.0 19.0 0.0 16940.0 94.1 39.0 w 0.0 0.0 10008.154816 10008.15 8400.00 1608.15 0.0 0.00 0.0000 196.31 709.0 705.0 0.0 45.0 1.0 Individual 0.0 79.0 145252.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 18000.0 0.0 0.0 0.0 8.0 7645.0 207.0 98.0 0.0 0.0 98.0 101.0 3.0 3.0 0.0 16.0 44.0 3.0 44.0 2.0 2.0 5.0 2.0 3.0 28.0 5.0 10.0 5.0 19.0 0.0 0.0 0.0 2.0 92.1 100.0 0.0 0.0 142744.0 145252.0 10500.0 124664.0 N Cash N 2.0 36.0 63.0 165.0 28.0 28.0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 0
180785 18000.0 18000.0 18000.0 7.89 563.15 A RENT 68000.0 Not Verified debt_consolidation NV 11.12 0.0 705.0 709.0 0.0 81.0 999.0 6.0 0.0 7540.0 55.4 21.0 w 0.0 0.0 20235.797565 20235.80 18000.00 2235.80 0.0 0.00 0.0000 2230.78 714.0 710.0 0.0 999.0 1.0 Individual 0.0 0.0 33713.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 13600.0 0.0 0.0 0.0 2.0 5619.0 4353.0 63.1 0.0 0.0 94.0 96.0 9.0 9.0 0.0 9.0 81.0 21.0 81.0 0.0 3.0 4.0 3.0 5.0 14.0 4.0 7.0 4.0 6.0 0.0 0.0 0.0 1.0 94.7 33.3 0.0 0.0 46695.0 33713.0 11800.0 33095.0 N Cash N 0.0 36.0 65.0 162.0 32.0 27.0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 1 0
1805875 12000.0 12000.0 12000.0 21.60 455.81 E RENT 165000.0 Not Verified credit_card NY 7.53 1.0 680.0 684.0 3.0 10.0 999.0 12.0 0.0 7883.0 47.2 25.0 f 0.0 0.0 13956.129946 13956.13 12000.00 1956.13 0.0 0.00 0.0000 9854.71 749.0 745.0 0.0 10.0 1.0 Individual 0.0 0.0 26652.0 0.0 0.0 0.0 0.0 999.0 0.0 72.0 0.0 0.0 0.0 58.0 16700.0 0.0 0.0 0.0 10.0 2221.0 327.0 85.1 0.0 0.0 22.0 162.0 2.0 2.0 0.0 2.0 34.0 0.0 34.0 1.0 2.0 7.0 2.0 11.0 2.0 9.0 22.0 7.0 12.0 0.0 0.0 1.0 5.0 86.0 100.0 0.0 0.0 42613.0 26652.0 2200.0 16500.0 N Cash N 2.0 36.0 88.0 251.0 79.0 22.0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0

B. Preprocessing Continuous Variables¶

Check the list of numerical features¶

  • Feature with more than 10 unique values will be considered as continuous.
  • Feautre with less than or equal to 10 unique values will be considered as discreate.
In [1169]:
# Step 1: Get numerical columns
num_cols = loan_data_inputs_train.select_dtypes(include=['float64', 'int64']).columns

# Step 2: Filter continuous features (optional rule of thumb: more than 10 unique values)
continuous_features = [col for col in num_cols if loan_data_inputs_train[col].nunique() > 10]

print("Continuous features:")
print(continuous_features)
print()
print('Number of continuous features:   ', len(continuous_features))
Continuous features:
['min_mths_since_delinquency']

Number of continuous features:    1
In [1170]:
# Get numerical features wit less than or equal 10 unique values (that can be considered as discrete features)
Other_features = [col for col in num_cols if loan_data_inputs_train[col].nunique() <= 10]

print("Numerical discrete features:")
print(Other_features)
print()
print('Number of numerical discrete features:   ', len(Other_features))
Numerical discrete features:
['grade:A', 'grade:B', 'grade:C', 'grade:D', 'grade:E', 'grade:F', 'grade:G', 'home_ownership:MORTGAGE', 'home_ownership:OWN', 'verification_status:Not Verified', 'verification_status:Source Verified', 'verification_status:Verified', 'purpose:debt_consolidation', 'purpose:credit_card', 'initial_list_status:f', 'initial_list_status:w', 'application_type:Individual', 'application_type:Joint App', 'hardship_flag:N', 'hardship_flag:Y', 'disbursement_method:Cash', 'disbursement_method:DirectPay', 'debt_settlement_flag:N', 'debt_settlement_flag:Y']

Number of numerical discrete features:    24

We found 88 continuous features and 6 discrete features.

Here's how I suggest organizing the features for credit risk modeling, splitting them into three categories:

Check for multicollinearity, the most common approach is to calculate the correlation matrix and/or compute the Variance Inflation Factor (VIF) for each feature.¶

A proposed classification of the numerical features into distinct feature families based on their meaning and purpose in credit risk modeling:

1. Loan and Funding Information

These features relate to the loan’s original terms and funding:

  • loan_amnt
  • funded_amnt
  • funded_amnt_inv
  • term_int
  • int_rate
  • installment

2. Applicant Financial Profile

Measures of income, employment duration.

  • annual_inc
  • emp_length_int
  • dti

3. Credit History and Delinquency

How the borrower has paid (or missed) obligations.

  • delinq_2yrs
  • mths_since_last_delinq
  • mths_since_last_record
  • collections_12_mths_ex_med
  • mths_since_last_major_derog
  • chargeoff_within_12_mths
  • delinq_amnt
  • acc_now_delinq
  • num_accts_ever_120_pd
  • num_tl_90g_dpd_24m
  • num_tl_120dpd_2m
  • num_tl_30dpd

4. Credit Utilization & Balance

Measures how much of available credit is being used:

  • revol_bal
  • revol_util
  • il_util
  • all_util
  • bc_util
  • bc_open_to_buy
  • total_bal_ex_mort
  • total_bal_il
  • max_bal_bc
  • avg_cur_bal
  • total_bc_limit

5. Credit Limits

Total credit available across various types:

  • tot_hi_cred_lim
  • total_rev_hi_lim
  • total_il_high_credit_limit
  • tot_cur_bal

6. Credit Account Status

These features indicate the number and types of open/active accounts:

  • open_acc
  • total_acc
  • open_acc_6m
  • open_act_il
  • open_il_12m
  • open_il_24m
  • open_rv_12m
  • open_rv_24m
  • num_sats
  • num_il_tl
  • num_rev_accts
  • num_actv_bc_tl
  • num_actv_rev_tl
  • num_bc_sats
  • num_bc_tl
  • num_op_rev_tl
  • num_rev_tl_bal_gt_0
  • acc_open_past_24mths
  • total_cu_tl

7. Credit Inquiries

Indicators of recent credit-seeking activity:

  • inq_fi
  • inq_last_6mths
  • inq_last_12m
  • mths_since_recent_inq

8. Payment and Recovery

These reflect payments and recovery-related metrics:

  • out_prncp
  • out_prncp_inv
  • total_pymnt
  • total_pymnt_inv
  • total_rec_prncp
  • total_rec_int
  • total_rec_late_fee
  • recoveries
  • collection_recovery_fee
  • last_pymnt_amnt

9. FICO Scores

Borrower’s credit score ranges:

  • fico_range_low
  • fico_range_high
  • last_fico_range_high
  • last_fico_range_low

10. Credit Line & History Timelines

Tracks age or recency of credit lines:

  • mths_since_earliest_cr_line
  • mths_since_issue_d
  • mo_sin_old_il_acct
  • mo_sin_old_rev_tl_op
  • mo_sin_rcnt_rev_tl_op
  • mo_sin_rcnt_tl
  • mths_since_rcnt_il
  • mths_since_recent_bc
  • mths_since_recent_bc_dlq
  • mths_since_recent_revol_delinq

11. Other / Miscellaneous

  • percent_bc_gt_75
  • pct_tl_nvr_dlq
  • tax_liens
  • pub_rec
  • pub_rec_bankruptcies
  • tot_coll_amt
  • mort_acc
  • months_since_last_pymnt
  • months_since_last_credit_pull
  • policy_code

Function that calculates the Variance Inflation Factor (VIF) for a given list of numerical features:¶

In [1171]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(data, num_features, sample_size=10000, random_state=42):
    """
    Calculates the Variance Inflation Factor (VIF) for the specified numerical features.

    Parameters:
    - data (pd.DataFrame): The full DataFrame.
    - num_features (list): List of numerical features to include in the VIF calculation.
    - sample_size (int): Number of rows to sample for computation.
    - random_state (int): Seed for reproducibility.

    Returns:
    - pd.DataFrame: Sorted DataFrame with features and their VIF values.
    """
    # Drop rows with missing values in the selected features
    X = data[num_features].dropna()

    # Sample the data to speed up VIF calculation
    X_sample = X.sample(n=sample_size, random_state=random_state)

    # Calculate VIF for each feature
    vif_data = pd.DataFrame()
    vif_data["feature"] = X_sample.columns
    vif_data["VIF"] = [variance_inflation_factor(X_sample.values, i) for i in range(X_sample.shape[1])]

    # Sort by VIF in descending order
    vif_data = vif_data.sort_values(by='VIF', ascending=False).reset_index(drop=True)
    
    return vif_data
  • VIF > 10 suggests serious multicollinearity, meaning the feature is highly predictable from other features and could distort model interpretation and stability.

  • VIF between 5–10 indicates moderate correlation.

  • VIF < 5 is generally considered acceptable.

Function version of the highly correlated feature removal process:¶

In [1172]:
def drop_highly_correlated_features(data, threshold=0.95):
    """
    Drops one of each pair of features with absolute correlation higher than the threshold.

    Parameters:
    - data (pd.DataFrame): Input DataFrame with numerical features.
    - threshold (float): Correlation threshold for dropping features.

    Returns:
    - pd.DataFrame: DataFrame with reduced features.
    - list: List of dropped features.
    """
    # Compute absolute correlation matrix
    corr_matrix = data.corr().abs()

    # Take the upper triangle of the correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Find columns with any correlation higher than the threshold
    to_drop = [col for col in upper.columns if any(upper[col] > threshold)]

    # Drop those columns
    reduced_data = data.drop(columns=to_drop)

    return reduced_data, to_drop

1. Loan Information¶

In [1173]:
# List of the Loan Information features
num_features = ['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term_int', 'int_rate', 'installment']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
           feature           VIF
0      funded_amnt  10171.338802
1  funded_amnt_inv   5723.614303
2        loan_amnt   3901.089311
3      installment     68.475463
4         term_int     21.619893
5         int_rate     15.790025

Recommendations¶

1. Drop highly collinear variables:

Keep only one among:

  • loan_amnt ✅ (commonly the most interpretable)
  • funded_amnt ❌
  • funded_amnt_inv ❌

Also consider dropping:

  • installment ❌ (very collinear, can be recreated from loan amount + interest rate + term)

2. Keep these features:

  • int_rate ✅ (Key driver of affordability and default)
  • term_int ✅ (Can be binned into short vs long-term if needed)

VIF analysis of the new set of features.¶

In [1174]:
# List of the Loan Information features
num_features = ['loan_amnt', 'term_int', 'int_rate']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
     feature        VIF
0   term_int  13.595521
1   int_rate  10.039828
2  loan_amnt   4.353243

The keeped variables are now moderate and low multicollinear.

2. Applicant Income & Employment¶

In [1175]:
# List of the Loan Information features
num_features = ['annual_inc', 'emp_length_int', 'dti']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
          feature       VIF
0  emp_length_int  2.135642
1             dti  1.909319
2      annual_inc  1.369659

Recommendation¶

These features are generally informative and complementary, so we do not need to drop any of them.

3. Credit History and Delinquency¶

In [1176]:
# List of the Loan Information features
num_features = ['delinq_2yrs', 'mths_since_last_delinq', 'mths_since_last_record', 'collections_12_mths_ex_med',
          'mths_since_last_major_derog', 'chargeoff_within_12_mths', 'delinq_amnt', 'acc_now_delinq', 'num_accts_ever_120_pd',
          'num_tl_90g_dpd_24m', 'num_tl_120dpd_2m', 'num_tl_30dpd']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                        feature       VIF
0                acc_now_delinq  6.438133
1   mths_since_last_major_derog  6.050945
2                  num_tl_30dpd  5.342100
3        mths_since_last_record  4.121314
4        mths_since_last_delinq  3.568357
5              num_tl_120dpd_2m  2.516389
6                   delinq_2yrs  2.242248
7            num_tl_90g_dpd_24m  2.021835
8         num_accts_ever_120_pd  1.450038
9                   delinq_amnt  1.414542
10     chargeoff_within_12_mths  1.059052
11   collections_12_mths_ex_med  1.020881

Recommendations¶

1. Keep As-Is:

These features have low VIF, likely offer independent information, and are good predictors of risk:

  • delinq_2yrs
  • num_tl_90g_dpd_24m
  • num_accts_ever_120_pd
  • num_tl_120dpd_2m
  • chargeoff_within_12_mths
  • delinq_amnt
  • collections_12_mths_ex_med

2. Watch for Moderate Multicollinearity:

These are "time since" delinquency features and may overlap:

  • mths_since_last_major_derog
  • mths_since_last_record
  • mths_since_last_delinq
  • acc_now_delinq

We can keep one or two of the most informative time-based ones.

Combine Features:

In [1177]:
# If multiple time-based features overlap, consider taking the minimum or latest date:
# Taking the minimum of 'mths_since_last_delinq' and 'mths_since_last_major_derog'
df_inputs_prepr['min_mths_since_delinquency'] = df_inputs_prepr[['mths_since_last_delinq', 'mths_since_last_major_derog']].min(axis=1)
In [1178]:
# 'acc_now_delinq' shows if the borrower is currently delinquent on any account.
# 'mths_since_last_record' indicates how recently a public derogatory record (e.g., bankruptcy, judgment) was filed.
# Recommended Approach: Feature Combination with Risk Buckets

# Create a flag for current delinquency:
df_inputs_prepr['has_delinquency_now'] = (df_inputs_prepr['acc_now_delinq'] > 0).astype(int)

# Bucketize mths_since_last_record
df_inputs_prepr['last_record_bucket'] = pd.cut(df_inputs_prepr['mths_since_last_record'], bins=[-1, 12, 24, 60, np.inf], 
                                               labels=['<1yr', '1-2yr', '2-5yr', '5+yr'])

#  Combine into a single categorical feature
df_inputs_prepr['delinq_record_combo'] = df_inputs_prepr['has_delinquency_now'].astype(str) + '_' + df_inputs_prepr['last_record_bucket'].astype(str)
In [1179]:
# Define risk levels manually = Transform 'delinq_record_combo' feature to numirical
risk_map = {
    '1_<1yr': 7,
    '1_1-2yr': 6,
    '1_2-5yr': 5,
    '1_5+yr': 4,
    '0_<1yr': 3,
    '0_1-2yr': 2,
    '0_2-5yr': 1,
    '0_5+yr': 0
}

df_inputs_prepr['delinq_record_risk_score'] = df_inputs_prepr['delinq_record_combo'].map(risk_map)

VIF analysis of the new set of features.¶

In [1180]:
# List of the Loan Information features
num_features = ['min_mths_since_delinquency', 'delinq_record_risk_score', 'delinq_2yrs', 'collections_12_mths_ex_med',
                'chargeoff_within_12_mths', 'delinq_amnt', 'num_accts_ever_120_pd',
                'num_tl_90g_dpd_24m', 'num_tl_120dpd_2m', 'num_tl_30dpd']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                      feature       VIF
0          num_tl_90g_dpd_24m  1.846076
1                 delinq_2yrs  1.813648
2    delinq_record_risk_score  1.583014
3            num_tl_120dpd_2m  1.526392
4                 delinq_amnt  1.420302
5                num_tl_30dpd  1.411802
6       num_accts_ever_120_pd  1.166534
7    chargeoff_within_12_mths  1.056240
8  collections_12_mths_ex_med  1.016597
9  min_mths_since_delinquency  1.012894

The VIF values for all remaining delinquency-related features are well below the common multicollinearity threshold of 5, indicating that multicollinearity is no longer a concern in this subset. This suggests that each feature contributes uniquely to the model and provides distinct information about the borrower's credit risk. Therefore, no further feature removal is needed based on multicollinearity, and this set can be retained for modeling.

4. Credit Utilization & Balance¶

In [1181]:
# List of the Loan Information features
num_features = ['revol_bal', 'total_bal_il', 'max_bal_bc', 'avg_cur_bal', 'revol_util', 'il_util',
                'all_util', 'bc_util','bc_open_to_buy', 'total_bal_ex_mort', 'total_bc_limit']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
              feature        VIF
0            all_util  36.721871
1             il_util  28.903446
2             bc_util  19.463804
3          revol_util  19.446893
4      total_bc_limit  14.343601
5      bc_open_to_buy   9.326915
6   total_bal_ex_mort   5.450999
7           revol_bal   3.475286
8        total_bal_il   2.768232
9          max_bal_bc   2.050169
10        avg_cur_bal   1.905737

Recommendations:¶

We have a set of credit limit and balance-related features that are highly collinear. These are very common in credit scoring datasets and often show extreme multicollinearity because they represent variations of the same financial behavior.

Multicollinearity Issue: VIF > 10 means serious multicollinearity. We can not include all of them directly in a model — they will distort the coefficients and inflate standard errors.

1. Total Credit Utilization (Core Features)

These can be engineered into ratios, which provide more meaningful insights.

In [1183]:
# Utilization Ratios
df_inputs_prepr['revol_bal_to_bc_limit'] = df_inputs_prepr['revol_bal'] / df_inputs_prepr['total_bc_limit'].replace(0, np.nan)
df_inputs_prepr['revol_bal_to_open_to_buy'] = df_inputs_prepr['revol_bal'] / df_inputs_prepr['bc_open_to_buy'].replace(0, np.nan)

# Balance to Income (if annual_inc is available):
df_inputs_prepr['total_bal_ex_mort_to_inc'] = df_inputs_prepr['total_bal_ex_mort'] / df_inputs_prepr['annual_inc'].replace(0, np.nan)

2. Drop Redundant Variables

Choose only one or two from each highly correlated group:

  • From all_util, il_util, revol_util, bc_util: We keep revol_util (widely used in credit models) and we drop the rest.
  • From total_bc_limit, bc_open_to_buy: We keep bc_open_to_buy (more dynamic).

Then we drop the original raw features used in these ratios and transformed features.

VIF analysis of the new set of features.¶

In [1184]:
# List of the Loan Information features
num_features = ['revol_bal', 'total_bal_il', 'max_bal_bc', 'avg_cur_bal', 'bc_open_to_buy', 'revol_bal_to_bc_limit', 
                'revol_bal_to_open_to_buy', 'total_bal_ex_mort_to_inc']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                    feature       VIF
0                 revol_bal  2.174330
1  total_bal_ex_mort_to_inc  2.077000
2              total_bal_il  1.819026
3                max_bal_bc  1.768437
4               avg_cur_bal  1.660353
5            bc_open_to_buy  1.414334
6     revol_bal_to_bc_limit  1.245156
7  revol_bal_to_open_to_buy  1.031061

The VIF values for all remaining delinquency-related features are well below the common multicollinearity threshold of 5, indicating that multicollinearity is no longer a concern in this subset. Therefore, no further feature removal is needed based on multicollinearity, and this set can be retained for modeling.

5. Credit Limits¶

In [1185]:
# List of the Loan Information features.
num_features = ['tot_hi_cred_lim', 'total_rev_hi_lim', 'total_il_high_credit_limit', 'tot_cur_bal']

# Calculate and print Variance Inflation Factor (VIF).
print(calculate_vif(df_inputs_prepr, num_features))
                      feature        VIF
0             tot_hi_cred_lim  86.048912
1                 tot_cur_bal  70.228960
2            total_rev_hi_lim   3.281832
3  total_il_high_credit_limit   1.988739

Recommendations¶

1. combine 'tot_hi_cred_lim' and 'tot_cur_bal' into a credit utilization-like ratio:

Captures how much of the credit limit is being used overall.

In [1186]:
df_inputs_prepr['total_balance_to_credit_ratio'] = df_inputs_prepr['tot_cur_bal'] / df_inputs_prepr['tot_hi_cred_lim'].replace(0, np.nan)

2. rev_to_il_limit_ratio = Installment vs. Revolving Exposure:

Gives insight into the borrower’s credit type distribution (revolving vs installment).

In [1187]:
df_inputs_prepr['rev_to_il_limit_ratio'] = df_inputs_prepr['total_rev_hi_lim'] / df_inputs_prepr['total_il_high_credit_limit'].replace(0, np.nan)

3. Keep the feature 'total_rev_hi_lim' as is: The borrower’s available revolving credit limit — important for understanding credit card capacity.

VIF analysis of the new set of features.¶

In [1188]:
# List of the Loan Information features.
num_features = ['total_balance_to_credit_ratio', 'rev_to_il_limit_ratio', 'total_il_high_credit_limit', 'tot_cur_bal']

# Calculate and print Variance Inflation Factor (VIF).
print(calculate_vif(df_inputs_prepr, num_features))
                         feature       VIF
0  total_balance_to_credit_ratio  3.732316
1     total_il_high_credit_limit  3.040626
2                    tot_cur_bal  2.632886
3          rev_to_il_limit_ratio  1.321904

The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.

6. Account and Credit Line Counts¶

In [1189]:
# List of the Loan Information features
num_features = ['open_acc', 'total_acc', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'open_rv_12m', 
                'open_rv_24m', 'num_sats', 'num_il_tl', 'num_rev_accts', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 
                'num_bc_tl', 'num_op_rev_tl', 'num_rev_tl_bal_gt_0', 'acc_open_past_24mths', 'total_cu_tl']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                 feature         VIF
0    num_rev_tl_bal_gt_0  135.413933
1        num_actv_rev_tl  125.544110
2               num_sats  121.894356
3               open_acc  121.812303
4              total_acc   81.870765
5          num_rev_accts   75.563319
6          num_op_rev_tl   52.440646
7              num_bc_tl   29.505863
8         num_actv_bc_tl   26.112984
9            num_bc_sats   24.070302
10             num_il_tl   15.243944
11  acc_open_past_24mths    6.171402
12           open_rv_12m    6.050756
13           open_rv_24m    5.908277
14           open_il_24m    5.383007
15           open_il_12m    4.179422
16           open_acc_6m    3.607243
17           open_act_il    3.128423
18           total_cu_tl    1.451131

The VIF results show very high multicollinearity among many account number/count features, especially for revolving, bankcard, and installment accounts.

1. Recommended Features to KEEP (low redundancy, strong representation):¶

  • open_acc: Broad indicator of currently open accounts; relatively interpretable.
  • num_rev_tl_bal_gt_0: Reflects active revolving accounts with a balance — important for risk.
  • num_il_tl: Captures installment loan history; low VIF and unique info.
  • acc_open_past_24mths: Proxy for recent credit-seeking behavior; moderately correlated.
  • total_cu_tl: Specific to credit union trades; unique dimension of credit profile.

2. Recommended Features to DROP (high VIF or redundant info):¶

  • total_acc: Highly collinear with open_acc and num_sats.
  • open_acc_6m: Overlaps with acc_open_past_24mths and others capturing recent openings.
  • open_act_il: Correlates with num_il_tl, brings little new info.
  • open_il_12m, open_il_24m: Temporal splits of installment openings — redundant with num_il_tl.
  • open_rv_12m, open_rv_24m: Same issue with revolving trades, overlaps with num_rev_tl_bal_gt_0.
  • num_sats: Nearly identical meaning to open_acc.
  • num_rev_accts: Overlaps heavily with num_op_rev_tl and num_actv_rev_tl.
  • num_actv_bc_tl, num_actv_rev_tl, num_bc_sats, num_bc_tl: Redundant with num_rev_tl_bal_gt_0.
  • num_op_rev_tl: Captured by broader num_rev_tl_bal_gt_0.

VIF analysis of the new set of features.¶

In [1190]:
# List of the Loan Information features
num_features = ['total_acc', 'open_act_il', 'open_il_12m',  'num_actv_rev_tl', 'open_rv_12m', 'num_bc_tl',
                'open_acc_6m', 'acc_open_past_24mths', 'total_cu_tl']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                feature       VIF
0             total_acc  7.873322
1             num_bc_tl  6.650931
2       num_actv_rev_tl  4.946451
3  acc_open_past_24mths  4.740237
4           open_acc_6m  3.602185
5           open_rv_12m  3.108141
6           open_il_12m  2.296064
7           open_act_il  1.777657
8           total_cu_tl  1.331291

The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.

7. Credit Inquiries¶

In [1191]:
# List of the Loan Information features
num_features =  ['inq_fi', 'inq_last_6mths', 'inq_last_12m', 'mths_since_recent_inq']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                 feature       VIF
0           inq_last_12m  2.346682
1                 inq_fi  2.126177
2         inq_last_6mths  1.209716
3  mths_since_recent_inq  1.006674

1. Keep All Features¶

Each feature captures slightly different information:

  • Frequency (how many inquiries: inq_last_12m, inq_last_6mths, inq_fi)
  • Recency (how recent: mths_since_recent_inq)

Credit risk models care both about how many inquiries you had (volume) and how recently (recency).

8. Payment and Recovery¶

In [1192]:
# List of the features you want to check
num_features = ['out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 
                'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                   feature           VIF
0              total_pymnt  6.526956e+13
1          total_rec_prncp  4.503600e+13
2            total_rec_int  2.609270e+12
3               recoveries  1.796159e+11
4       total_rec_late_fee  2.310876e+07
5                out_prncp  1.035721e+06
6            out_prncp_inv  1.035720e+06
7          total_pymnt_inv  5.605819e+03
8  collection_recovery_fee  1.083006e+01
9          last_pymnt_amnt  3.216300e+00

There is massive multicollinearity. These features are all highly dependent on each other, because they are accounting/cash flow features related to loan repayment.

1. Drop Highly Redundant Features¶

Keep only 1 or 2 summary features instead of everything.

  • Suggested to KEEP:
    • last_pymnt_amnt ➔ amount of last payment (dynamic signal).
    • out_prncp ➔ current outstanding principal balance.
  • Suggested to DROP or not prioritize:
    • total_pymnt, total_rec_prncp, total_rec_int, recoveries, total_rec_late_fee, collection_recovery_fee, total_pymnt_inv, out_prncp_inv ➔ all highly redundant and overlapping.

2. Create Aggregated Features (Optional)¶

We can keep more information without multicollinearity:

  • Principal paid ratio: Proportion of principal repaid (good for default modeling).
In [1193]:
df_inputs_prepr['principal_paid_ratio'] = df_inputs_prepr['total_rec_prncp'] / df_inputs_prepr['loan_amnt']

VIF analysis of the new set of features.¶

In [1194]:
# List of the features you want to check
num_features = ['out_prncp', 'last_pymnt_amnt', 'principal_paid_ratio']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                feature       VIF
0  principal_paid_ratio  1.739074
1       last_pymnt_amnt  1.737794
2             out_prncp  1.000991

The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.

9. FICO Scores¶

In [1195]:
# List of the features you want to check
num_features =  ['fico_range_low', 'fico_range_high', 'last_fico_range_high', 'last_fico_range_low']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                feature           VIF
0       fico_range_high  1.505941e+07
1        fico_range_low  1.505512e+07
2  last_fico_range_high  2.315564e+02
3   last_fico_range_low  7.445950e+01
  • Including both low and high versions of FICO scores causes massive redundancy.
  • The model would have unstable coefficients, difficulty interpreting feature importance, and inflated errors.

1. Drop One of Each Pair¶

We do not need both low and high scores — they are highly correlated.

  • Keep only one from the original FICO range (fico_range_high or fico_range_low).
  • Keep only one from the last FICO range (last_fico_range_high or last_fico_range_low).

10. Credit Line & History Timelines¶

In [1196]:
# List of the features you want to check
num_features =  ['mths_since_earliest_cr_line', 'mths_since_issue_d', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op',
                 'mo_sin_rcnt_tl', 'mths_since_rcnt_il', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_revol_delinq']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                          feature        VIF
0     mths_since_earliest_cr_line  52.837152
1            mo_sin_old_rev_tl_op  25.293201
2              mths_since_issue_d  18.405433
3        mths_since_recent_bc_dlq  11.219079
4  mths_since_recent_revol_delinq   8.458308
5              mo_sin_old_il_acct   7.728318
6              mths_since_rcnt_il   5.044650
7           mo_sin_rcnt_rev_tl_op   3.561668
8                  mo_sin_rcnt_tl   2.924126
9            mths_since_recent_bc   2.582552

Main Observations:¶

  • The top 3 (mths_since_earliest_cr_line, mo_sin_old_rev_tl_op, mths_since_issue_d) have very high VIFs (>10) → strong multicollinearity.
  • The others (bottom 7) have moderate multicollinearity (VIF ~2.5–5).

Why?

  • Many of these are time features measuring similar things: age of accounts, recency of new accounts, recency of delinquencies, etc.
  • Naturally, older credit history → older revolving accounts, older installment accounts, etc.
  • Loan issue date is also highly tied to the borrower's credit age profile.
  • Redundancy between "oldest" and "most recent" time features.
  • Models can become unstable and overfit due to high correlation between these time measures.

Recommendations for Feature Engineering:¶

1. Drop Some Redundant "Oldest" Features

  • These three features are very correlated: mths_since_earliest_cr_line, mo_sin_old_rev_tl_op, mo_sin_old_il_acct
  • Suggestion:
    • Keep only mths_since_earliest_cr_line (captures overall credit age).
    • Drop mo_sin_old_rev_tl_op and mo_sin_old_il_acct.

2. Handle Loan Issue Date Carefully

  • 'mths_since_issue_d': reflects the age of the loan. It might be useful, but it's strongly collinear with other "months since" features.
  • Suggestion: drop it to reduce multicollinearity.

3. Keep Recency of Recent Activity These are different kinds of recency indicators:

  • mths_since_recent_bc_dlq
  • mths_since_recent_revol_delinq
  • mths_since_rcnt_il
  • mo_sin_rcnt_rev_tl_op
  • mo_sin_rcnt_tl
  • mths_since_recent_bc

Suggestion:

  • Keep most of these for now — they capture recent delinquency, recent account opening, and recency of activity, all of which are important for credit risk.
  • Later you can cluster or combine them if needed (e.g., minimum of all recent months as a "most recent event" feature).

VIF analysis of the new set of features.¶

In [1197]:
# List of the features you want to check
num_features = ['mths_since_earliest_cr_line', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mths_since_rcnt_il', 
                'mths_since_recent_bc', 'mths_since_recent_revol_delinq']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                          feature       VIF
0     mths_since_earliest_cr_line  4.032080
1           mo_sin_rcnt_rev_tl_op  3.555260
2                  mo_sin_rcnt_tl  2.902862
3              mths_since_rcnt_il  2.572720
4            mths_since_recent_bc  2.571783
5  mths_since_recent_revol_delinq  2.474972

The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.

11. Other / Miscellaneous¶

In [1198]:
# List of the features you want to check
num_features =  ['percent_bc_gt_75', 'pct_tl_nvr_dlq', 'tax_liens', 'pub_rec', 'pub_rec_bankruptcies', 'tot_coll_amt',
                 'mort_acc', 'months_since_last_pymnt', 'months_since_last_credit_pull', 'policy_code']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                         feature         VIF
0                    policy_code  127.367693
1                        pub_rec    7.787948
2           pub_rec_bankruptcies    4.551261
3                      tax_liens    3.970360
4        months_since_last_pymnt    1.463992
5  months_since_last_credit_pull    1.453885
6                 pct_tl_nvr_dlq    1.026930
7                   tot_coll_amt    1.008896
8                       mort_acc    1.005815
9               percent_bc_gt_75    1.004729

1. Keep Timing and Ratio Variables As-Is¶

pub_rec_bankruptcies, months_since_last_pymnt, months_since_last_credit_pull, pct_tl_nvr_dlq, tot_coll_amt, percent_bc_gt_75 and mort_acc are good and low-collinearity.

2. Assess Relationship Between Public Record Variables¶

  • pub_rec, pub_rec_bankruptcies, tax_liens are all related to legal financial problems.
  • Suggestion:
    • Keep pub_rec_bankruptcies separately — bankruptcies have a big credit impact.
    • Consider combining pub_rec and tax_liens to Create a single indicator and then drop them individually.
In [1199]:
# Combining 'pub_rec' and 'tax_liens' 
df_inputs_prepr['total_public_records'] = df_inputs_prepr['pub_rec'] + df_inputs_prepr['tax_liens']

3. Drop policy_code¶

It usually has only 1 unique value (1) in the Lending Club dataset ➔ Drop it immediately — adds no predictive value.

VIF analysis of the new set of features.¶

In [1200]:
# List of the features you want to check
num_features =  ['percent_bc_gt_75', 'pub_rec_bankruptcies', 'tot_coll_amt', 'mort_acc', 
                 'months_since_last_credit_pull', 'total_public_records']

# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
                         feature       VIF
0  months_since_last_credit_pull  2.443156
1               percent_bc_gt_75  2.144920
2                       mort_acc  1.512708
3           pub_rec_bankruptcies  1.476033
4           total_public_records  1.444479
5                   tot_coll_amt  1.010763

The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.

12. Feature Reduction Based on Multicollinearity¶

To enhance model stability and reduce redundancy, a Variance Inflation Factor (VIF) analysis was conducted on all numerical features. High VIF values indicate multicollinearity, which can distort model coefficients and impair generalization. Based on this analysis, 54 numerical features exhibiting strong multicollinearity (VIF > threshold) were identified and excluded from further modeling to retain only the most informative and independent predictors.

In [1201]:
# List of features to drop.
feats_num_to_drop = ['funded_amnt', 'funded_amnt_inv', 'installment', 'mths_since_last_delinq', 'mths_since_last_record', 
                     'mths_since_last_major_derog', 'acc_now_delinq', 'has_delinquency_now', 'last_record_bucket', 
                     'delinq_record_combo', 'all_util', 'il_util', 'bc_util', 'revol_util', 'total_bc_limit',  
                     'total_bal_ex_mort', 'tot_hi_cred_lim', 'open_acc', 'open_il_24m', 'open_rv_24m',       
                     'num_sats', 'num_il_tl', 'num_rev_accts', 'num_actv_bc_tl',  'num_bc_sats', 
                     'num_op_rev_tl', 'num_rev_tl_bal_gt_0', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',      
                     'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
                     'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 
                     'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'fico_range_low', 'last_fico_range_low',
                     'mo_sin_old_rev_tl_op', 'mo_sin_old_il_acct', 'mths_since_issue_d', 'mths_since_recent_bc_dlq',
                     'policy_code', 'pub_rec', 'tax_liens', 'months_since_last_pymnt', 'pct_tl_nvr_dlq']
print(len(feats_num_to_drop))
54
In [1202]:
# Drop this set of features from the df_inputs_prepr dataframe.
df_inputs_prepr = df_inputs_prepr.drop(columns = feats_num_to_drop)

C. Engineering of Numerical Variables¶

Checking the list and the number of features after these preprocessing:¶

In [1203]:
List_num_features = ['loan_amnt', 'term_int', 'int_rate', 'annual_inc', 'emp_length_int', 'dti', 'min_mths_since_delinquency', 
                     'delinq_record_risk_score', 'delinq_2yrs', 'collections_12_mths_ex_med', 'chargeoff_within_12_mths', 
                     'delinq_amnt', 'num_accts_ever_120_pd', 'num_tl_90g_dpd_24m', 'num_tl_120dpd_2m', 'num_tl_30dpd', 
                     'revol_bal', 'total_bal_il', 'max_bal_bc', 'avg_cur_bal', 'bc_open_to_buy', 'revol_bal_to_bc_limit', 
                     'revol_bal_to_open_to_buy', 'total_bal_ex_mort_to_inc', 'total_balance_to_credit_ratio', 'rev_to_il_limit_ratio', 
                     'total_il_high_credit_limit', 'tot_cur_bal', 'total_acc', 'open_act_il', 'open_il_12m',  'num_actv_rev_tl', 
                     'open_rv_12m', 'num_bc_tl', 'open_acc_6m', 'acc_open_past_24mths', 'total_cu_tl', 'inq_fi', 'inq_last_6mths',
                     'inq_last_12m', 'mths_since_recent_inq', 'out_prncp', 'last_pymnt_amnt', 'principal_paid_ratio',
                     'fico_range_high', 'last_fico_range_high', 'mths_since_earliest_cr_line', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 
                     'mths_since_rcnt_il', 'mths_since_recent_bc', 'mths_since_recent_revol_delinq', 'percent_bc_gt_75', 
                     'pub_rec_bankruptcies', 'tot_coll_amt', 'mort_acc', 'months_since_last_credit_pull', 'total_public_records']

print('number of features after preprocessing: ', len(List_num_features))
number of features after preprocessing:  58

We obtain now 58 features after the preprocessing of numerical variables instead of the original number of 94.

Classification of the numerical features into discrete or continuous:¶

Here's a "smart" automatic strategy for classifying features into discrete or continuous, not just based on n_unique, but combining:

  • Number of unique values.
  • Variance (dispersion).
  • Data type (integer vs float).

Smart Strategy Logic:¶

  • 1. If the feature is integer type:

    • If unique values ≤ 15 → treat as discrete.
    • Else → if variance is low → treat as discrete, otherwise continuous.
  • 2. If the feature is float type:

    • Always treat as continuous (except if very few unique values, like ≤ 5).
  • 3. If the feature has extremely low variance (almost constant), treat it as discrete.

In [1204]:
# Function to classify the features
def classify_feature(df, threshold_unique=15, threshold_variance=0.01):
    discrete_features = []
    continuous_features = []
    
    for col in df.columns:
        if np.issubdtype(df[col].dtype, np.number):  # only numeric features
            n_unique = df[col].nunique()
            variance = df[col].var()
            
            if np.issubdtype(df[col].dtype, np.integer):
                if n_unique <= threshold_unique or variance < threshold_variance:
                    discrete_features.append(col)
                else:
                    continuous_features.append(col)
            else:  # float
                if n_unique <= 5 or variance < threshold_variance:
                    discrete_features.append(col)
                else:
                    continuous_features.append(col)
    
    return discrete_features, continuous_features

Why is this better?¶

  • It adapts to your real data — not just the number of unique values blindly.
  • It respects the nature of "counts" vs "proportions" vs "scores".
  • It avoids misclassifying slightly continuous features with few categories.

Classification of the numerical features into discrete or continous¶

In [1205]:
# Assuming df_inputs_prepr is your preprocessed dataset
df_inputs_prepr_class = df_inputs_prepr[List_num_features].copy()
discr_features, conti_features = classify_feature(df_inputs_prepr_class)

print("Discrete features:", discr_features)
print()
print("Continuous features:", conti_features)
Discrete features: ['term_int', 'delinq_record_risk_score', 'num_tl_120dpd_2m', 'num_tl_30dpd']

Continuous features: ['loan_amnt', 'int_rate', 'annual_inc', 'emp_length_int', 'dti', 'min_mths_since_delinquency', 'delinq_2yrs', 'collections_12_mths_ex_med', 'chargeoff_within_12_mths', 'delinq_amnt', 'num_accts_ever_120_pd', 'num_tl_90g_dpd_24m', 'revol_bal', 'total_bal_il', 'max_bal_bc', 'avg_cur_bal', 'bc_open_to_buy', 'revol_bal_to_bc_limit', 'revol_bal_to_open_to_buy', 'total_bal_ex_mort_to_inc', 'total_balance_to_credit_ratio', 'rev_to_il_limit_ratio', 'total_il_high_credit_limit', 'tot_cur_bal', 'total_acc', 'open_act_il', 'open_il_12m', 'num_actv_rev_tl', 'open_rv_12m', 'num_bc_tl', 'open_acc_6m', 'acc_open_past_24mths', 'total_cu_tl', 'inq_fi', 'inq_last_6mths', 'inq_last_12m', 'mths_since_recent_inq', 'out_prncp', 'last_pymnt_amnt', 'principal_paid_ratio', 'fico_range_high', 'last_fico_range_high', 'mths_since_earliest_cr_line', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mths_since_rcnt_il', 'mths_since_recent_bc', 'mths_since_recent_revol_delinq', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tot_coll_amt', 'mort_acc', 'months_since_last_credit_pull', 'total_public_records']

With this classification strategy we found:

  • 4 discrete features.
  • 54 continous features.

WoE and IV classification of features:¶

  • We apply Weight of Evidence (WoE) transformation to numerical features to create a stronger, more interpretable relationship with the target.
  • Information Value (IV) helps prioritize the most predictive variables for credit risk modeling.

Function to evaluate WoE and IV of a continuous variable¶

In [1206]:
# WoE function for ordered discrete and continuous variables
def woe_ordered_continuous(df, discrete_variabe_name, good_bad_variable_df):
    df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_good'] = df['prop_good'] * df['n_obs']
    df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log1p(df['prop_n_good'] / df['prop_n_bad'])
    #df = df.sort_values(['WoE'])
    #df = df.reset_index(drop = True)
    df['diff_prop_good'] = df['prop_good'].diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    df['IV'] = df['IV'].sum()
    return df
# Here we define a function similar to the one above, ...
# ... with one slight difference: we order the results by the values of a different column.
# The function takes 3 arguments: a dataframe, a string, and a dataframe. The function returns a dataframe as a result.

Variable: 'term_int'¶

In [1207]:
df_temp = woe_ordered_continuous(df_inputs_prepr, 'term_int', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1207]:
term_int n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 36.0 206971 0.171783 0.754724 35554.0 171417.0 0.604639 0.79569 0.565253 NaN NaN 0.09772
1 60.0 67263 0.345628 0.245276 23248.0 44015.0 0.395361 0.20431 1.076741 0.173846 0.511488 0.09772
In [1208]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1209]:
# Leave as is.
# '60' will be the reference category.
df_inputs_prepr['term:36'] = np.where((df_inputs_prepr['term_int'] == 36), 1, 0)
df_inputs_prepr['term:60'] = np.where((df_inputs_prepr['term_int'] == 60), 1, 0)

Variable: 'num_tl_120dpd_2m'¶

In [1210]:
# 'num_tl_120dpd_2m'
df_inputs_prepr['num_tl_120dpd_2m'].unique()
# Has only 6 levels: from 0 to 6. Hence, we turn it into a factor with 6 levels.
Out[1210]:
array([0., 1., 2.])
In [1211]:
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_tl_120dpd_2m', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1211]:
num_tl_120dpd_2m n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 274036 0.214435 0.999278 58763.0 215273.0 0.999337 0.999262 0.693185 NaN NaN 0.000024
1 1.0 191 0.204188 0.000696 39.0 152.0 0.000663 0.000706 0.662701 0.010247 0.030484 0.000024
2 2.0 7 0.000000 0.000026 0.0 7.0 0.000000 0.000032 0.000000 0.204188 0.662701 0.000024
In [1212]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1213]:
# We create the following categories: '0', '1', '2 - 6'
# '2-6' will be the reference category
df_inputs_prepr['num_tl_120dpd_2m:0'] = np.where(df_inputs_prepr['num_tl_120dpd_2m'].isin([0]), 1, 0)
df_inputs_prepr['num_tl_120dpd_2m:1'] = np.where(df_inputs_prepr['num_tl_120dpd_2m'].isin([1]), 1, 0)
df_inputs_prepr['num_tl_120dpd_2m:2-6'] = np.where(df_inputs_prepr['num_tl_120dpd_2m'].isin(range(2, 7)), 1, 0)

Variable: 'num_tl_30dpd'¶

In [1214]:
# 'num_tl_30dpd'
df_inputs_prepr['num_tl_30dpd'].unique()
# Has only 6 levels: from 0 to 6. Hence, we turn it into a factor with 6 levels.
Out[1214]:
array([0., 2., 1., 3.])
In [1215]:
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_tl_30dpd', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1215]:
num_tl_30dpd n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 273429 0.214330 0.997065 58604.0 214825.0 0.996633 0.997182 0.692872 NaN NaN 0.000058
1 1.0 761 0.248357 0.002775 189.0 572.0 0.003214 0.002655 0.793243 0.034028 0.100371 0.000058
2 2.0 37 0.216216 0.000135 8.0 29.0 0.000136 0.000135 0.698469 0.032141 0.094774 0.000058
3 3.0 7 0.142857 0.000026 1.0 6.0 0.000017 0.000028 0.476616 0.073359 0.221853 0.000058
In [1216]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1217]:
# We create the following categories: '0 - 2', '3', '4'
# '4' will be the reference category
df_inputs_prepr['num_tl_30dpd:0'] = np.where(df_inputs_prepr['num_tl_30dpd'].isin([0]), 1, 0)
df_inputs_prepr['num_tl_30dpd:1'] = np.where(df_inputs_prepr['num_tl_30dpd'].isin([1]), 1, 0)
df_inputs_prepr['num_tl_30dpd:2-4'] = np.where(df_inputs_prepr['num_tl_30dpd'].isin(range(2, 5)), 1, 0)

Variable: 'delinq_record_risk_score'¶

In [1218]:
# 'num_tl_30dpd'
df_inputs_prepr['delinq_record_risk_score'].unique()
# Has only 6 levels: from 0 to 6. Hence, we turn it into a factor with 6 levels.
Out[1218]:
array([0, 2, 1, 4, 3, 5, 7, 6], dtype=int64)
In [1219]:
df_temp = woe_ordered_continuous(df_inputs_prepr, 'delinq_record_risk_score', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1219]:
delinq_record_risk_score n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0 256999 0.213304 0.937152 54819.0 202180.0 0.932264 0.938486 0.689827 NaN NaN 0.00046
1 1 13401 0.230505 0.048867 3089.0 10312.0 0.052532 0.047867 0.740732 0.017201 0.050906 0.00046
2 2 1588 0.229849 0.005791 365.0 1223.0 0.006207 0.005677 0.738796 0.000656 0.001936 0.00046
3 3 987 0.245187 0.003599 242.0 745.0 0.004116 0.003458 0.783939 0.015339 0.045143 0.00046
4 4 1166 0.237564 0.004252 277.0 889.0 0.004711 0.004127 0.761531 0.007623 0.022408 0.00046
5 5 69 0.086957 0.000252 6.0 63.0 0.000102 0.000292 0.299306 0.150608 0.462225 0.00046
6 6 13 0.076923 0.000047 1.0 12.0 0.000017 0.000056 0.266438 0.010033 0.032868 0.00046
7 7 11 0.272727 0.000040 3.0 8.0 0.000051 0.000037 0.864527 0.195804 0.598088 0.00046
In [1220]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1221]:
# We create the following categories: '0', '1 - 2', '3 - 4', '5 - 7'
# '5-7' will be the reference category
df_inputs_prepr['delinq_record_risk_score:0'] = np.where(df_inputs_prepr['delinq_record_risk_score'].isin([0]), 1, 0)
df_inputs_prepr['delinq_record_risk_score:1-2'] = np.where(df_inputs_prepr['delinq_record_risk_score'].isin(range(1, 3)), 1, 0)
df_inputs_prepr['delinq_record_risk_score:3-4'] = np.where(df_inputs_prepr['delinq_record_risk_score'].isin(range(3, 5)), 1, 0)
df_inputs_prepr['delinq_record_risk_score:5-7'] = np.where(df_inputs_prepr['delinq_record_risk_score'].isin(range(5, 8)), 1, 0)

Variable: 'loan_amnt'¶

In [1222]:
# loan_amnt
df_inputs_prepr['loan_amnt_factor'] = pd.cut(df_inputs_prepr['loan_amnt'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'loan_amnt_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1222]:
loan_amnt_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (460.5, 1290.0] 1742 0.128588 0.006352 224.0 1518.0 0.003809 0.007046 0.432187 NaN NaN 0.022676
1 (1290.0, 2080.0] 3896 0.154517 0.014207 602.0 3294.0 0.010238 0.015290 0.512562 0.025930 0.080375 0.022676
2 (2080.0, 2870.0] 3583 0.149595 0.013065 536.0 3047.0 0.009115 0.014144 0.497425 0.004922 0.015136 0.022676
3 (2870.0, 3660.0] 7864 0.173194 0.028676 1362.0 6502.0 0.023162 0.030181 0.569536 0.023599 0.072111 0.022676
4 (3660.0, 4450.0] 6291 0.175648 0.022940 1105.0 5186.0 0.018792 0.024073 0.576970 0.002453 0.007434 0.022676
5 (4450.0, 5240.0] 14267 0.177402 0.052025 2531.0 11736.0 0.043043 0.054477 0.582280 0.001755 0.005310 0.022676
6 (5240.0, 6030.0] 13689 0.165827 0.049917 2270.0 11419.0 0.038604 0.053005 0.547144 0.011576 0.035136 0.022676
7 (6030.0, 6820.0] 4791 0.173868 0.017470 833.0 3958.0 0.014166 0.018372 0.571577 0.008041 0.024434 0.022676
8 (6820.0, 7610.0] 10612 0.168300 0.038697 1786.0 8826.0 0.030373 0.040969 0.554673 0.005568 0.016905 0.022676
9 (7610.0, 8400.0] 13440 0.180283 0.049009 2423.0 11017.0 0.041206 0.051139 0.590984 0.011983 0.036311 0.022676
10 (8400.0, 9190.0] 6964 0.170017 0.025394 1184.0 5780.0 0.020135 0.026830 0.559893 0.010266 0.031091 0.022676
11 (9190.0, 9980.0] 5397 0.201223 0.019680 1086.0 4311.0 0.018469 0.020011 0.653851 0.031206 0.093958 0.022676
12 (9980.0, 10770.0] 23726 0.210613 0.086517 4997.0 18729.0 0.084980 0.086937 0.681829 0.009390 0.027978 0.022676
13 (10770.0, 11560.0] 6664 0.230642 0.024300 1537.0 5127.0 0.026139 0.023799 0.741137 0.020029 0.059308 0.022676
14 (11560.0, 12350.0] 16752 0.211736 0.061087 3547.0 13205.0 0.060321 0.061295 0.685167 0.018906 0.055969 0.022676
15 (12350.0, 13140.0] 5349 0.217798 0.019505 1165.0 4184.0 0.019812 0.019421 0.703158 0.006062 0.017991 0.022676
16 (13140.0, 13930.0] 2814 0.264748 0.010261 745.0 2069.0 0.012670 0.009604 0.841227 0.046950 0.138068 0.022676
17 (13930.0, 14720.0] 7764 0.231968 0.028312 1801.0 5963.0 0.030628 0.027679 0.745047 0.032780 0.096180 0.022676
18 (14720.0, 15510.0] 16292 0.221581 0.059409 3610.0 12682.0 0.061392 0.058868 0.714364 0.010387 0.030682 0.022676
19 (15510.0, 16300.0] 9284 0.243968 0.033854 2265.0 7019.0 0.038519 0.032581 0.780359 0.022387 0.065994 0.022676
20 (16300.0, 17090.0] 4042 0.239485 0.014739 968.0 3074.0 0.016462 0.014269 0.767183 0.004483 0.013175 0.022676
21 (17090.0, 17880.0] 2133 0.280825 0.007778 599.0 1534.0 0.010187 0.007121 0.888140 0.041340 0.120957 0.022676
22 (17880.0, 18670.0] 7859 0.236035 0.028658 1855.0 6004.0 0.031547 0.027870 0.757030 0.044790 0.131110 0.022676
23 (18670.0, 19460.0] 3120 0.267949 0.011377 836.0 2284.0 0.014217 0.010602 0.850578 0.031914 0.093548 0.022676
24 (19460.0, 20250.0] 16121 0.238323 0.058786 3842.0 12279.0 0.065338 0.056997 0.763763 0.029626 0.086815 0.022676
25 (20250.0, 21040.0] 4870 0.244148 0.017759 1189.0 3681.0 0.020220 0.017087 0.780887 0.005825 0.017124 0.022676
26 (21040.0, 21830.0] 1549 0.300194 0.005648 465.0 1084.0 0.007908 0.005032 0.944528 0.056046 0.163641 0.022676
27 (21830.0, 22620.0] 2759 0.246829 0.010061 681.0 2078.0 0.011581 0.009646 0.788757 0.053365 0.155771 0.022676
28 (22620.0, 23410.0] 1818 0.246975 0.006629 449.0 1369.0 0.007636 0.006355 0.789186 0.000146 0.000429 0.022676
29 (23410.0, 24200.0] 7816 0.236566 0.028501 1849.0 5967.0 0.031445 0.027698 0.758593 0.010409 0.030593 0.022676
30 (24200.0, 24990.0] 1128 0.275709 0.004113 311.0 817.0 0.005289 0.003792 0.873225 0.039143 0.114632 0.022676
31 (24990.0, 25780.0] 7941 0.227049 0.028957 1803.0 6138.0 0.030662 0.028492 0.730532 0.048660 0.142693 0.022676
32 (25780.0, 26570.0] 1337 0.257292 0.004875 344.0 993.0 0.005850 0.004609 0.819424 0.030243 0.088892 0.022676
33 (26570.0, 27360.0] 1162 0.274527 0.004237 319.0 843.0 0.005425 0.003913 0.869776 0.017234 0.050352 0.022676
34 (27360.0, 28150.0] 4685 0.210672 0.017084 987.0 3698.0 0.016785 0.017166 0.682006 0.063854 0.187770 0.022676
35 (28150.0, 28940.0] 689 0.256894 0.002512 177.0 512.0 0.003010 0.002377 0.818258 0.046222 0.136252 0.022676
36 (28940.0, 29730.0] 821 0.258222 0.002994 212.0 609.0 0.003605 0.002827 0.822143 0.001328 0.003886 0.022676
37 (29730.0, 30520.0] 6437 0.275128 0.023473 1771.0 4666.0 0.030118 0.021659 0.871531 0.016906 0.049387 0.022676
38 (30520.0, 31310.0] 644 0.296584 0.002348 191.0 453.0 0.003248 0.002103 0.934026 0.021456 0.062495 0.022676
39 (31310.0, 32100.0] 1569 0.284895 0.005721 447.0 1122.0 0.007602 0.005208 0.899997 0.011689 0.034028 0.022676
40 (32100.0, 32890.0] 476 0.319328 0.001736 152.0 324.0 0.002585 0.001504 1.000178 0.034433 0.100181 0.022676
41 (32890.0, 33680.0] 771 0.258106 0.002811 199.0 572.0 0.003384 0.002655 0.821806 0.061221 0.178372 0.022676
42 (33680.0, 34470.0] 382 0.324607 0.001393 124.0 258.0 0.002109 0.001198 1.015535 0.066501 0.193729 0.022676
43 (34470.0, 35260.0] 10860 0.262523 0.039601 2851.0 8009.0 0.048485 0.037176 0.834724 0.062084 0.180811 0.022676
44 (35260.0, 36050.0] 329 0.273556 0.001200 90.0 239.0 0.001531 0.001109 0.866945 0.011033 0.032221 0.022676
45 (36050.0, 36840.0] 35 0.257143 0.000128 9.0 26.0 0.000153 0.000121 0.818986 0.016413 0.047959 0.022676
46 (36840.0, 37630.0] 53 0.207547 0.000193 11.0 42.0 0.000187 0.000195 0.672708 0.049596 0.146278 0.022676
47 (37630.0, 38420.0] 72 0.250000 0.000263 18.0 54.0 0.000306 0.000251 0.798060 0.042453 0.125352 0.022676
48 (38420.0, 39210.0] 37 0.216216 0.000135 8.0 29.0 0.000136 0.000135 0.698469 0.033784 0.099591 0.022676
49 (39210.0, 40000.0] 1538 0.283485 0.005608 436.0 1102.0 0.007415 0.005115 0.895890 0.067269 0.197422 0.022676
In [1223]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1224]:
# We create the following categories:
# < 2500, 2500 - 6500, 6500 - 9500, 9500 - 10800, 10800 - 17500, 17500 - 28500, >= 28500.
df_inputs_prepr['loan_amnt:<2500'] = np.where((df_inputs_prepr['loan_amnt'] <= 2500.), 1, 0)
df_inputs_prepr['loan_amnt:2500-6500'] = np.where((df_inputs_prepr['loan_amnt'] > 2500.) & (df_inputs_prepr['loan_amnt'] <= 6500.), 1, 0)
df_inputs_prepr['loan_amnt:6500-9500'] = np.where((df_inputs_prepr['loan_amnt'] > 6500.) & (df_inputs_prepr['loan_amnt'] <= 9500.), 1, 0)
df_inputs_prepr['loan_amnt:9500-11000'] = np.where((df_inputs_prepr['loan_amnt'] > 9500.) & (df_inputs_prepr['loan_amnt'] <= 10800.), 1, 0)
df_inputs_prepr['loan_amnt:11000-17500'] = np.where((df_inputs_prepr['loan_amnt'] > 10800.) & (df_inputs_prepr['loan_amnt'] <= 17500.), 1, 0)
df_inputs_prepr['loan_amnt:17500-28500'] = np.where((df_inputs_prepr['loan_amnt'] > 17500.) & (df_inputs_prepr['loan_amnt'] <= 28500.), 1, 0)
df_inputs_prepr['loan_amnt:>=28500'] = np.where((df_inputs_prepr['loan_amnt'] > 28500.), 1, 0)
In [1225]:
# Drop 'loan_amnt_factor' feature
df_inputs_prepr = df_inputs_prepr.drop(columns = ['loan_amnt_factor'])

Variable: 'int_rate'¶

In [1226]:
# unique values of 'int_rate'
df_inputs_prepr['int_rate'].unique()
Out[1226]:
array([13.99, 18.06, 12.29,  7.89, 21.6 , 16.29, 14.33, 14.64, 14.49,
       24.99, 18.49, 10.15,  5.32, 13.49, 19.03,  8.18,  9.67, 12.49,
       15.61, 21.97, 11.53, 16.99,  9.99,  6.92, 16.02,  6.03, 10.91,
       12.12, 13.35,  6.97,  7.9 ,  7.49,  7.62,  8.19, 11.49, 11.47,
       10.64,  8.9 , 19.99, 14.99, 10.99,  8.49, 13.53, 19.24,  9.16,
        8.67, 11.99, 17.86, 30.75,  9.44, 12.69, 13.11, 12.74, 16.01,
       14.46, 11.55, 14.16, 20.2 , 15.1 , 18.99, 15.59, 21.49, 13.59,
       11.67,  9.17, 26.49,  8.99,  7.97,  9.91, 13.67,  7.39, 19.52,
        8.24, 28.67, 19.97,  9.93,  8.39, 14.08,  6.49, 12.99, 13.18,
        6.62, 16.46, 17.57, 14.31, 15.49,  7.91, 25.69, 20.99,  9.76,
       18.25,  7.35, 13.65, 23.28,  6.72, 14.03, 17.09, 12.62, 12.79,
       17.27,  7.69,  9.75, 17.99, 10.38,  7.26, 22.47, 10.75,  7.84,
       16.55,  9.92, 10.49, 13.66, 14.65, 18.55, 13.98,  7.99,  9.8 ,
       15.31, 24.74, 30.79, 24.7 , 11.22, 21.48, 14.47, 11.86, 15.41,
       11.97, 15.88, 13.33, 16.24, 17.1 , 12.18, 12.35,  6.24,  8.38,
       20.31, 14.09, 23.99, 20.89, 18.24,  7.34,  7.02, 19.05, 21.99,
       19.19, 10.42, 11.14,  6.91, 15.77, 17.56,  6.08,  7.46, 16.2 ,
        6.68, 30.99, 21.  , 29.69,  9.58, 22.35, 11.39,  7.51, 22.15,
       19.48, 17.58, 20.39, 13.06, 15.05, 15.8 , 24.84,  6.89, 20.5 ,
        6.39, 11.71,  6.  , 12.05, 18.75, 24.49, 14.98, 11.98, 20.  ,
       13.44, 22.95, 30.84, 18.84, 10.  , 25.89, 19.53, 22.74, 24.5 ,
       12.42,  9.49, 23.7 , 11.06, 24.85, 12.73, 10.56, 15.95, 22.39,
       26.77, 14.26, 26.57, 22.99,  8.59, 10.16, 18.54, 11.44, 10.78,
       25.81, 17.76, 15.99,  5.93, 19.22,  6.71,  5.99, 19.42,  7.07,
       30.17, 12.59, 21.45, 19.72, 17.77, 21.67, 22.9 ,  5.31, 17.47,
        7.12, 18.2 , 16.14, 22.7 , 15.22, 13.85, 20.25, 10.41, 10.08,
       27.27, 12.88, 21.36, 21.85,  7.66, 16.7 ,  7.14, 28.69, 25.29,
       20.75, 20.49, 29.49, 15.02, 14.3 ,  7.37,  6.99, 12.39, 16.59,
       19.89,  6.07, 11.48, 22.4 , 14.84, 16.49, 13.68, 30.94, 15.27,
        7.24, 12.98,  8.94, 14.07, 19.47, 12.85,  6.54, 14.48, 23.43,
        9.71,  7.21, 10.74, 13.56, 13.72, 19.16, 19.2 , 12.61, 17.14,
       25.99,  7.96,  8.6 ,  7.74, 13.05, 26.24, 11.11,  9.43, 17.49,
       26.06, 21.18, 14.85, 20.8 , 21.7 , 28.99, 26.3 , 17.97, 19.29,
       23.1 , 22.45, 13.57, 18.85, 28.72, 27.31, 25.82,  7.88, 27.34,
       17.93, 21.98, 18.45, 11.26, 23.76, 23.26, 18.92, 15.04, 13.8 ,
       14.52, 14.17, 30.89, 16.45, 28.49, 16.77, 23.88, 23.63, 15.23,
       10.47, 24.37, 25.78, 25.09, 28.88, 22.2 , 12.13, 12.84,  9.63,
       25.57, 25.49, 23.5 ,  8.08, 14.27,  7.59, 22.78, 24.08, 13.61,
       21.28, 18.94, 21.15, 19.69, 25.83, 25.88, 29.99, 23.13, 16.69,
       30.65, 27.79,  6.19, 16.91,  8.32, 25.28, 15.96, 13.58, 12.87,
       19.92, 10.37, 11.36, 13.16, 18.64,  8.46, 23.4 , 15.21,  7.29,
       19.13, 15.81, 26.99, 24.89, 16.33, 30.49, 12.68, 16.78,  7.4 ,
        9.45,  8.88,  6.67, 22.91, 10.95, 16.  , 10.59, 25.8 , 12.53,
       13.23, 19.79, 23.32, 24.11,  9.25, 14.22, 14.91,  6.17, 25.34,
       13.24, 11.05,  7.56,  5.79, 27.49,  9.88, 13.92, 24.24, 24.83,
        5.42, 10.9 , 25.44, 11.89, 13.79, 15.68, 11.8 ,  9.33, 10.65,
       16.32, 14.35, 15.62, 20.9 , 14.72, 11.83, 23.87, 18.79, 26.14,
       28.18, 12.92, 15.2 , 19.91,  6.83, 20.62, 23.91, 26.31, 17.43,
       16.4 , 15.28, 10.33, 17.46, 10.62, 15.7 , 14.11, 10.07, 30.74,
       10.36, 16.95, 12.23,  6.11, 10.25,  9.07, 28.14, 11.12, 11.91,
       23.59,  9.32, 15.33,  9.2 , 11.34, 20.16, 12.09, 17.8 , 14.42,
       25.11, 24.33, 10.28, 27.88, 17.39,  9.62, 13.48, 15.57, 23.83,
       27.99, 18.39, 14.79, 13.22, 14.59, 14.83, 14.54, 13.47,  8.81,
       17.74, 28.34,  8.  , 16.89,  6.46,  9.7 , 14.96, 16.35, 11.28,
       14.74,  9.83, 18.3 , 13.43, 23.33,  6.76, 11.66, 10.72, 15.65,
        7.68, 12.41, 11.58, 17.88, 20.3 , 16.82, 15.58, 10.39, 17.51,
        7.43, 10.71, 19.82, 10.83, 11.63, 13.04, 19.74, 17.19, 22.11,
       11.31, 18.67, 11.46, 12.21, 16.07, 18.78, 22.06,  8.63, 11.54,
       19.41, 11.03, 19.36, 18.62,  9.64, 21.59, 24.2 , 29.67, 15.37,
       29.96, 20.48, 19.04, 23.52, 10.14, 11.59, 22.48, 13.3 ,  7.75,
       17.06,  8.7 , 25.65, 18.72, 20.11, 10.51, 14.61, 15.13, 18.91,
       11.09, 17.04, 20.53, 19.66, 12.72, 13.17, 21.74, 16.63, 10.46,
       24.52, 18.07,  7.05,  8.07, 12.54, 12.8 , 18.43, 14.75, 12.22,
       18.17, 21.14, 20.85, 14.18, 10.96, 20.03, 16.08, 11.78, 13.75,
       15.45,  9.38, 11.41, 12.36,  9.01, 20.77, 21.27, 14.93, 23.22,
       13.55, 24.76, 12.67, 17.26, 17.34, 16.11, 12.04, 12.17,  7.42,
       12.86, 14.82, 21.82, 13.62, 10.2 , 17.15,  9.51])
In [1227]:
# loan_amnt
df_inputs_prepr['int_rate_factor'] = pd.cut(df_inputs_prepr['int_rate'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'int_rate_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1227]:
int_rate_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (5.284, 5.824] 6219 0.039878 0.022678 248.0 5971.0 0.004218 0.027716 0.141645 NaN NaN 0.209145
1 (5.824, 6.337] 4802 0.036443 0.017511 175.0 4627.0 0.002976 0.021478 0.129770 0.003435 0.011876 0.209145
2 (6.337, 6.851] 4886 0.055055 0.017817 269.0 4617.0 0.004575 0.021431 0.193473 0.018612 0.063704 0.209145
3 (6.851, 7.364] 9903 0.070181 0.036111 695.0 9208.0 0.011819 0.042742 0.244143 0.015125 0.050670 0.209145
4 (7.364, 7.878] 6047 0.078551 0.022051 475.0 5572.0 0.008078 0.025864 0.271797 0.008371 0.027654 0.209145
5 (7.878, 8.392] 18415 0.100081 0.067151 1843.0 16572.0 0.031342 0.076925 0.341776 0.021530 0.069979 0.209145
6 (8.392, 8.905] 4307 0.088925 0.015706 383.0 3924.0 0.006513 0.018215 0.305713 0.011156 0.036063 0.209145
7 (8.905, 9.419] 7881 0.118513 0.028738 934.0 6947.0 0.015884 0.032247 0.400499 0.029588 0.094787 0.209145
8 (9.419, 9.932] 9372 0.143833 0.034175 1348.0 8024.0 0.022924 0.037246 0.479635 0.025320 0.079136 0.209145
9 (9.932, 10.446] 8276 0.137869 0.030179 1141.0 7135.0 0.019404 0.033119 0.461140 0.005964 0.018494 0.209145
10 (10.446, 10.96] 8089 0.173445 0.029497 1403.0 6686.0 0.023860 0.031035 0.570297 0.035577 0.109157 0.209145
11 (10.96, 11.473] 14962 0.154458 0.054559 2311.0 12651.0 0.039301 0.058724 0.512379 0.018987 0.057919 0.209145
12 (11.473, 11.987] 10049 0.171161 0.036644 1720.0 8329.0 0.029251 0.038662 0.563368 0.016703 0.050989 0.209145
13 (11.987, 12.5] 15824 0.173913 0.057703 2752.0 13072.0 0.046801 0.060678 0.571715 0.002752 0.008347 0.209145
14 (12.5, 13.014] 15417 0.214893 0.056218 3313.0 12104.0 0.056342 0.056185 0.694542 0.040980 0.122827 0.209145
15 (13.014, 13.528] 10572 0.209894 0.038551 2219.0 8353.0 0.037737 0.038773 0.679692 0.004999 0.014850 0.209145
16 (13.528, 14.041] 14814 0.239503 0.054020 3548.0 11266.0 0.060338 0.052295 0.767236 0.029609 0.087544 0.209145
17 (14.041, 14.555] 10999 0.240204 0.040108 2642.0 8357.0 0.044930 0.038792 0.769296 0.000700 0.002060 0.209145
18 (14.555, 15.068] 9532 0.267520 0.034759 2550.0 6982.0 0.043366 0.032409 0.849325 0.027316 0.080030 0.209145
19 (15.068, 15.582] 4554 0.228371 0.016606 1040.0 3514.0 0.017686 0.016311 0.734433 0.039149 0.114892 0.209145
20 (15.582, 16.096] 12273 0.280535 0.044754 3443.0 8830.0 0.058552 0.040987 0.887293 0.052164 0.152860 0.209145
21 (16.096, 16.609] 7214 0.285694 0.026306 2061.0 5153.0 0.035050 0.023919 0.902326 0.005160 0.015033 0.209145
22 (16.609, 17.123] 6386 0.313811 0.023287 2004.0 4382.0 0.034080 0.020341 0.984135 0.028117 0.081808 0.209145
23 (17.123, 17.636] 6197 0.306923 0.022597 1902.0 4295.0 0.032346 0.019937 0.964101 0.006889 0.020034 0.209145
24 (17.636, 18.15] 5692 0.345397 0.020756 1966.0 3726.0 0.033434 0.017295 1.076067 0.038474 0.111966 0.209145
25 (18.15, 18.664] 6027 0.338477 0.021978 2040.0 3987.0 0.034693 0.018507 1.055904 0.006920 0.020163 0.209145
26 (18.664, 19.177] 5708 0.345480 0.020814 1972.0 3736.0 0.033536 0.017342 1.076309 0.007003 0.020405 0.209145
27 (19.177, 19.691] 3499 0.368963 0.012759 1291.0 2208.0 0.021955 0.010249 1.144900 0.023483 0.068592 0.209145
28 (19.691, 20.204] 4606 0.382327 0.016796 1761.0 2845.0 0.029948 0.013206 1.184102 0.013365 0.039202 0.209145
29 (20.204, 20.718] 1330 0.328571 0.004850 437.0 893.0 0.007432 0.004145 1.027069 0.053756 0.157033 0.209145
30 (20.718, 21.232] 2649 0.392601 0.009660 1040.0 1609.0 0.017686 0.007469 1.214341 0.064030 0.187273 0.209145
31 (21.232, 21.745] 2271 0.393219 0.008281 893.0 1378.0 0.015187 0.006396 1.216163 0.000618 0.001822 0.209145
32 (21.745, 22.259] 1727 0.413434 0.006298 714.0 1013.0 0.012142 0.004702 1.276005 0.020215 0.059842 0.209145
33 (22.259, 22.772] 1798 0.397108 0.006556 714.0 1084.0 0.012142 0.005032 1.227640 0.016326 0.048365 0.209145
34 (22.772, 23.286] 1345 0.439405 0.004905 591.0 754.0 0.010051 0.003500 1.353685 0.042297 0.126045 0.209145
35 (23.286, 23.8] 894 0.359060 0.003260 321.0 573.0 0.005459 0.002660 1.115938 0.080345 0.237747 0.209145
36 (23.8, 24.313] 1836 0.456972 0.006695 839.0 997.0 0.014268 0.004628 1.406852 0.097911 0.290914 0.209145
37 (24.313, 24.827] 1043 0.410355 0.003803 428.0 615.0 0.007279 0.002855 1.266859 0.046617 0.139993 0.209145
38 (24.827, 25.34] 1562 0.484635 0.005696 757.0 805.0 0.012874 0.003737 1.491831 0.074280 0.224972 0.209145
39 (25.34, 25.854] 1367 0.455011 0.004985 622.0 745.0 0.010578 0.003458 1.400889 0.029624 0.090942 0.209145
40 (25.854, 26.368] 1081 0.474561 0.003942 513.0 568.0 0.008724 0.002637 1.460689 0.019550 0.059799 0.209145
41 (26.368, 26.881] 362 0.569061 0.001320 206.0 156.0 0.003503 0.000724 1.764378 0.094500 0.303690 0.209145
42 (26.881, 27.395] 300 0.493333 0.001094 148.0 152.0 0.002517 0.000706 1.518916 0.075727 0.245462 0.209145
43 (27.395, 27.908] 186 0.612903 0.000678 114.0 72.0 0.001939 0.000334 1.917045 0.119570 0.398129 0.209145
44 (27.908, 28.422] 114 0.535088 0.000416 61.0 53.0 0.001037 0.000246 1.651864 0.077816 0.265181 0.209145
45 (28.422, 28.936] 452 0.522124 0.001648 236.0 216.0 0.004013 0.001003 1.610021 0.012964 0.041843 0.209145
46 (28.936, 29.449] 52 0.596154 0.000190 31.0 21.0 0.000527 0.000097 1.857594 0.074030 0.247573 0.209145
47 (29.449, 29.963] 248 0.552419 0.000904 137.0 111.0 0.002330 0.000515 1.708712 0.043734 0.148881 0.209145
48 (29.963, 30.476] 228 0.429825 0.000831 98.0 130.0 0.001667 0.000603 1.324912 0.122595 0.383800 0.209145
49 (30.476, 30.99] 867 0.522491 0.003162 453.0 414.0 0.007704 0.001922 1.611199 0.092667 0.286287 0.209145
In [1228]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1229]:
# We create the following categories:
# < 8, 8 - 12.5, 12.5 - 16.5, 16.5 - 20, 20 - 23.5, >= 23.5
df_inputs_prepr['int_rate:<=8'] = np.where((df_inputs_prepr['int_rate'] <= 8.0), 1, 0)
df_inputs_prepr['int_rate:8-12.5'] = np.where((df_inputs_prepr['int_rate'] > 8.0) & (df_inputs_prepr['int_rate'] <= 12.5), 1, 0)
df_inputs_prepr['int_rate:12.5-16.5'] = np.where((df_inputs_prepr['int_rate'] > 12.5) & (df_inputs_prepr['int_rate'] <= 16.5), 1, 0)
df_inputs_prepr['int_rate:16.5-20'] = np.where((df_inputs_prepr['int_rate'] > 16.5) & (df_inputs_prepr['int_rate'] <= 20.0), 1, 0)
df_inputs_prepr['int_rate:20-23.5'] = np.where((df_inputs_prepr['int_rate'] > 20.0) & (df_inputs_prepr['int_rate'] <= 23.5), 1, 0)
df_inputs_prepr['int_rate:>23.5'] = np.where((df_inputs_prepr['int_rate'] > 23.5), 1, 0)
In [1230]:
# Drop 'loan_amnt_factor' feature
df_inputs_prepr = df_inputs_prepr.drop(columns = ['int_rate_factor'])

Variable: 'annual_inc'¶

In [1231]:
# annual_inc
df_inputs_prepr['annual_inc_factor'] = pd.cut(df_inputs_prepr['annual_inc'], 75)
# Here we do fine-classing: using the 'cut' method, we split the variable into 100 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'annual_inc_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1231]:
annual_inc_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-9500.0, 126666.667] 247646 0.219491 0.903046 54356.0 193290.0 0.924390 0.897220 0.708175 NaN NaN inf
1 (126666.667, 253333.333] 23822 0.169801 0.086867 4045.0 19777.0 0.068790 0.091802 0.559236 0.049690 0.148939 inf
2 (253333.333, 380000.0] 1839 0.150625 0.006706 277.0 1562.0 0.004711 0.007251 0.500597 0.019176 0.058639 inf
3 (380000.0, 506666.667] 537 0.111732 0.001958 60.0 477.0 0.001020 0.002214 0.379012 0.038893 0.121585 inf
4 (506666.667, 633333.333] 174 0.149425 0.000634 26.0 148.0 0.000442 0.000687 0.496901 0.037693 0.117889 inf
5 (633333.333, 760000.0] 78 0.166667 0.000284 13.0 65.0 0.000221 0.000302 0.549702 0.017241 0.052801 inf
6 (760000.0, 886666.667] 38 0.210526 0.000139 8.0 30.0 0.000136 0.000139 0.681572 0.043860 0.131870 inf
7 (886666.667, 1013333.333] 39 0.179487 0.000142 7.0 32.0 0.000119 0.000149 0.588581 0.031039 0.092990 inf
8 (1013333.333, 1140000.0] 15 0.133333 0.000055 2.0 13.0 0.000034 0.000060 0.447019 0.046154 0.141563 inf
9 (1140000.0, 1266666.667] 9 0.111111 0.000033 1.0 8.0 0.000017 0.000037 0.377039 0.022222 0.069980 inf
10 (1266666.667, 1393333.333] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.111111 0.377039 inf
11 (1393333.333, 1520000.0] 7 0.285714 0.000026 2.0 5.0 0.000034 0.000023 0.902384 0.285714 0.902384 inf
12 (1520000.0, 1646666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
13 (1646666.667, 1773333.333] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
14 (1773333.333, 1900000.0] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.000000 0.000000 inf
15 (1900000.0, 2026666.667] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 inf
16 (2026666.667, 2153333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
17 (2153333.333, 2280000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
18 (2280000.0, 2406666.667] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 NaN NaN inf
19 (2406666.667, 2533333.333] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.000000 0.000000 inf
20 (2533333.333, 2660000.0] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
21 (2660000.0, 2786666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
22 (2786666.667, 2913333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
23 (2913333.333, 3040000.0] 3 0.666667 0.000011 2.0 1.0 0.000034 0.000005 2.119548 NaN NaN inf
24 (3040000.0, 3166666.667] 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.166667 0.579742 inf
25 (3166666.667, 3293333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
26 (3293333.333, 3420000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
27 (3420000.0, 3546666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
28 (3546666.667, 3673333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
29 (3673333.333, 3800000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
30 (3800000.0, 3926666.667] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
31 (3926666.667, 4053333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
32 (4053333.333, 4180000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
33 (4180000.0, 4306666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
34 (4306666.667, 4433333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
35 (4433333.333, 4560000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
36 (4560000.0, 4686666.667] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
37 (4686666.667, 4813333.333] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
38 (4813333.333, 4940000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
39 (4940000.0, 5066666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
40 (5066666.667, 5193333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
41 (5193333.333, 5320000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
42 (5320000.0, 5446666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
43 (5446666.667, 5573333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
44 (5573333.333, 5700000.0] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
45 (5700000.0, 5826666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
46 (5826666.667, 5953333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
47 (5953333.333, 6080000.0] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
48 (6080000.0, 6206666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
49 (6206666.667, 6333333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
50 (6333333.333, 6460000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
51 (6460000.0, 6586666.667] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
52 (6586666.667, 6713333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
53 (6713333.333, 6840000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
54 (6840000.0, 6966666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
55 (6966666.667, 7093333.333] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 NaN NaN inf
56 (7093333.333, 7220000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
57 (7220000.0, 7346666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
58 (7346666.667, 7473333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
59 (7473333.333, 7600000.0] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
60 (7600000.0, 7726666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
61 (7726666.667, 7853333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
62 (7853333.333, 7980000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
63 (7980000.0, 8106666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
64 (8106666.667, 8233333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
65 (8233333.333, 8360000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
66 (8360000.0, 8486666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
67 (8486666.667, 8613333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
68 (8613333.333, 8740000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
69 (8740000.0, 8866666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
70 (8866666.667, 8993333.333] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
71 (8993333.333, 9120000.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
72 (9120000.0, 9246666.667] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
73 (9246666.667, 9373333.333] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
74 (9373333.333, 9500000.0] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
In [1232]:
# Initial examination shows that there are too few individuals with large income and too many with small income.
# Hence, we are going to have one category for more than 140K, and we are going to apply our approach to determine
# the categories of everyone with 140k or less.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['annual_inc'] <= 140000., : ]
#loan_data_temp = loan_data_temp.reset_index(drop = True)
#df_inputs_prepr_temp
df_inputs_prepr_temp["annual_inc_factor"] = pd.cut(df_inputs_prepr_temp['annual_inc'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'annual_inc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3753315463.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp["annual_inc_factor"] = pd.cut(df_inputs_prepr_temp['annual_inc'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1232]:
annual_inc_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-140.0, 2800.0] 83 0.240964 0.000325 20.0 63.0 0.000360 0.000316 0.760526 NaN NaN 0.010553
1 (2800.0, 5600.0] 31 0.354839 0.000122 11.0 20.0 0.000198 0.000100 1.089912 0.113875 0.329386 0.010553
2 (5600.0, 8400.0] 78 0.269231 0.000306 21.0 57.0 0.000378 0.000286 0.842560 0.085608 0.247352 0.010553
3 (8400.0, 11200.0] 350 0.291429 0.001372 102.0 248.0 0.001835 0.001243 0.906712 0.022198 0.064152 0.010553
4 (11200.0, 14000.0] 647 0.281298 0.002537 182.0 465.0 0.003275 0.002331 0.877455 0.010130 0.029257 0.010553
5 (14000.0, 16800.0] 1014 0.279093 0.003976 283.0 731.0 0.005092 0.003665 0.871081 0.002206 0.006374 0.010553
6 (16800.0, 19600.0] 1285 0.257588 0.005038 331.0 954.0 0.005956 0.004783 0.808830 0.021505 0.062251 0.010553
7 (19600.0, 22400.0] 2672 0.264222 0.010477 706.0 1966.0 0.012704 0.009856 0.828057 0.006634 0.019227 0.010553
8 (22400.0, 25200.0] 4492 0.255120 0.017613 1146.0 3346.0 0.020621 0.016775 0.801672 0.009101 0.026385 0.010553
9 (25200.0, 28000.0] 3879 0.260376 0.015209 1010.0 2869.0 0.018174 0.014383 0.816916 0.005256 0.015243 0.010553
10 (28000.0, 30800.0] 6069 0.245840 0.023796 1492.0 4577.0 0.026847 0.022946 0.774714 0.014537 0.042202 0.010553
11 (30800.0, 33600.0] 5768 0.248440 0.022616 1433.0 4335.0 0.025785 0.021733 0.782273 0.002600 0.007559 0.010553
12 (33600.0, 36400.0] 9387 0.251838 0.036806 2364.0 7023.0 0.042537 0.035209 0.792144 0.003398 0.009871 0.010553
13 (36400.0, 39200.0] 6374 0.238783 0.024992 1522.0 4852.0 0.027386 0.024325 0.754172 0.013055 0.037972 0.010553
14 (39200.0, 42000.0] 13875 0.234739 0.054403 3257.0 10618.0 0.058605 0.053232 0.742383 0.004044 0.011789 0.010553
15 (42000.0, 44800.0] 5049 0.246583 0.019797 1245.0 3804.0 0.022402 0.019071 0.776877 0.011845 0.034494 0.010553
16 (44800.0, 47600.0] 11615 0.232114 0.045542 2696.0 8919.0 0.048511 0.044715 0.734722 0.014470 0.042155 0.010553
17 (47600.0, 50400.0] 15763 0.232443 0.061806 3664.0 12099.0 0.065929 0.060657 0.735684 0.000329 0.000962 0.010553
18 (50400.0, 53200.0] 8349 0.219907 0.032736 1836.0 6513.0 0.033036 0.032652 0.699011 0.012536 0.036673 0.010553
19 (53200.0, 56000.0] 11837 0.225395 0.046412 2668.0 9169.0 0.048007 0.045968 0.715086 0.005488 0.016074 0.010553
20 (56000.0, 58800.0] 5150 0.220194 0.020193 1134.0 4016.0 0.020405 0.020134 0.699855 0.005201 0.015231 0.010553
21 (58800.0, 61600.0] 13673 0.224823 0.053611 3074.0 10599.0 0.055313 0.053137 0.713411 0.004628 0.013556 0.010553
22 (61600.0, 64400.0] 6577 0.216360 0.025788 1423.0 5154.0 0.025605 0.025839 0.688607 0.008463 0.024804 0.010553
23 (64400.0, 67200.0] 11622 0.223714 0.045569 2600.0 9022.0 0.046784 0.045231 0.710165 0.007354 0.021558 0.010553
24 (67200.0, 70000.0] 11659 0.217343 0.045714 2534.0 9125.0 0.045596 0.045747 0.691492 0.006371 0.018673 0.010553
25 (70000.0, 72800.0] 4992 0.199319 0.019573 995.0 3997.0 0.017904 0.020039 0.638407 0.018024 0.053085 0.010553
26 (72800.0, 75600.0] 10040 0.213546 0.039366 2144.0 7896.0 0.038578 0.039586 0.680341 0.014227 0.041934 0.010553
27 (75600.0, 78400.0] 4336 0.200876 0.017001 871.0 3465.0 0.015673 0.017371 0.643010 0.012669 0.037331 0.010553
28 (78400.0, 81200.0] 9345 0.209524 0.036641 1958.0 7387.0 0.035232 0.037034 0.668512 0.008647 0.025502 0.010553
29 (81200.0, 84000.0] 4429 0.191465 0.017366 848.0 3581.0 0.015259 0.017953 0.615143 0.018058 0.053369 0.010553
30 (84000.0, 86800.0] 6724 0.206425 0.026364 1388.0 5336.0 0.024975 0.026752 0.659384 0.014959 0.044240 0.010553
31 (86800.0, 89600.0] 3212 0.190224 0.012594 611.0 2601.0 0.010994 0.013040 0.611458 0.016201 0.047925 0.010553
32 (89600.0, 92400.0] 7977 0.193180 0.031277 1541.0 6436.0 0.027728 0.032266 0.620231 0.002956 0.008773 0.010553
33 (92400.0, 95200.0] 4984 0.186798 0.019542 931.0 4053.0 0.016752 0.020319 0.601274 0.006383 0.018957 0.010553
34 (95200.0, 98000.0] 3475 0.195108 0.013625 678.0 2797.0 0.012200 0.014023 0.625944 0.008310 0.024670 0.010553
35 (98000.0, 100800.0] 6710 0.188972 0.026310 1268.0 5442.0 0.022816 0.027283 0.607738 0.006136 0.018206 0.010553
36 (100800.0, 103600.0] 2338 0.165526 0.009167 387.0 1951.0 0.006964 0.009781 0.537625 0.023446 0.070113 0.010553
37 (103600.0, 106400.0] 3469 0.170366 0.013602 591.0 2878.0 0.010634 0.014429 0.552176 0.004840 0.014551 0.010553
38 (106400.0, 109200.0] 1740 0.172989 0.006822 301.0 1439.0 0.005416 0.007214 0.560042 0.002622 0.007866 0.010553
39 (109200.0, 112000.0] 4536 0.189153 0.017785 858.0 3678.0 0.015439 0.018439 0.608278 0.016165 0.048236 0.010553
40 (112000.0, 114800.0] 878 0.149203 0.003443 131.0 747.0 0.002357 0.003745 0.488222 0.039951 0.120056 0.010553
41 (114800.0, 117600.0] 2454 0.174002 0.009622 427.0 2027.0 0.007683 0.010162 0.563078 0.024799 0.074856 0.010553
42 (117600.0, 120400.0] 4985 0.201204 0.019546 1003.0 3982.0 0.018048 0.019963 0.643977 0.027202 0.080899 0.010553
43 (120400.0, 123200.0] 805 0.159006 0.003156 128.0 677.0 0.002303 0.003394 0.517955 0.042197 0.126022 0.010553
44 (123200.0, 126000.0] 2880 0.187847 0.011292 541.0 2339.0 0.009735 0.011726 0.604396 0.028841 0.086440 0.010553
45 (126000.0, 128800.0] 584 0.136986 0.002290 80.0 504.0 0.001439 0.002527 0.450885 0.050861 0.153511 0.010553
46 (128800.0, 131600.0] 2504 0.171725 0.009818 430.0 2074.0 0.007737 0.010398 0.556254 0.034739 0.105369 0.010553
47 (131600.0, 134400.0] 641 0.159126 0.002513 102.0 539.0 0.001835 0.002702 0.518318 0.012599 0.037936 0.010553
48 (134400.0, 137200.0] 1583 0.162982 0.006207 258.0 1325.0 0.004642 0.006643 0.529958 0.003855 0.011640 0.010553
49 (137200.0, 140000.0] 2121 0.165488 0.008316 351.0 1770.0 0.006316 0.008874 0.537510 0.002506 0.007552 0.010553
In [1233]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1234]:
# WoE is monotonically decreasing with income, so we split income in 10 equal categories
df_inputs_prepr['log_annual_inc:<20K'] = np.where((df_inputs_prepr['annual_inc'] <= 20000), 1, 0)
df_inputs_prepr['annual_inc:20K-30K'] = np.where((df_inputs_prepr['annual_inc'] > 20000) & (df_inputs_prepr['annual_inc'] <= 30000), 1, 0)
df_inputs_prepr['annual_inc:30K-40K'] = np.where((df_inputs_prepr['annual_inc'] > 30000) & (df_inputs_prepr['annual_inc'] <= 40000), 1, 0)
df_inputs_prepr['annual_inc:40K-50K'] = np.where((df_inputs_prepr['annual_inc'] > 40000) & (df_inputs_prepr['annual_inc'] <= 50000), 1, 0)
df_inputs_prepr['annual_inc:50K-60K'] = np.where((df_inputs_prepr['annual_inc'] > 50000) & (df_inputs_prepr['annual_inc'] <= 60000), 1, 0)
df_inputs_prepr['annual_inc:60K-70K'] = np.where((df_inputs_prepr['annual_inc'] > 60000) & (df_inputs_prepr['annual_inc'] <= 70000), 1, 0)
df_inputs_prepr['annual_inc:70K-80K'] = np.where((df_inputs_prepr['annual_inc'] > 70000) & (df_inputs_prepr['annual_inc'] <= 80000), 1, 0)
df_inputs_prepr['annual_inc:80K-90K'] = np.where((df_inputs_prepr['annual_inc'] > 80000) & (df_inputs_prepr['annual_inc'] <= 90000), 1, 0)
df_inputs_prepr['annual_inc:90K-100K'] = np.where((df_inputs_prepr['annual_inc'] > 90000) & (df_inputs_prepr['annual_inc'] <= 100000), 1, 0)
df_inputs_prepr['annual_inc:100K-120K'] = np.where((df_inputs_prepr['annual_inc'] > 100000) & (df_inputs_prepr['annual_inc'] <= 120000), 1, 0)
df_inputs_prepr['annual_inc:120K-140K'] = np.where((df_inputs_prepr['annual_inc'] > 120000) & (df_inputs_prepr['annual_inc'] <= 140000), 1, 0)
df_inputs_prepr['annual_inc:>140K'] = np.where((df_inputs_prepr['annual_inc'] > 140000), 1, 0)
In [1235]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['annual_inc_factor'])
# Drop the provisory feature

Variable: 'emp_length_int'¶

In [1236]:
df_inputs_prepr['emp_length_int'].unique()
Out[1236]:
array([ 7.,  2.,  0., 10.,  8.,  1.,  3.,  5.,  9.,  4.,  6.])
In [1237]:
# mths_since_issue_d
df_temp = woe_ordered_continuous(df_inputs_prepr, 'emp_length_int', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1237]:
emp_length_int n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 22011 0.221707 0.080264 4880.0 17131.0 0.082990 0.079519 0.714738 NaN NaN 0.000484
1 1.0 18068 0.223987 0.065885 4047.0 14021.0 0.068824 0.065083 0.721482 0.002280 0.006744 0.000484
2 2.0 24789 0.210698 0.090394 5223.0 19566.0 0.088824 0.090822 0.682083 0.013289 0.039399 0.000484
3 3.0 22148 0.217446 0.080763 4816.0 17332.0 0.081902 0.080452 0.702116 0.006748 0.020033 0.000484
4 4.0 16597 0.216244 0.060521 3589.0 13008.0 0.061035 0.060381 0.698551 0.001202 0.003565 0.000484
5 5.0 17151 0.209550 0.062541 3594.0 13557.0 0.061120 0.062929 0.678670 0.006693 0.019881 0.000484
6 6.0 12651 0.204332 0.046132 2585.0 10066.0 0.043961 0.046725 0.663128 0.005219 0.015542 0.000484
7 7.0 12214 0.202964 0.044539 2479.0 9735.0 0.042158 0.045188 0.659048 0.001368 0.004080 0.000484
8 8.0 12253 0.208276 0.044681 2552.0 9701.0 0.043400 0.045030 0.674876 0.005312 0.015828 0.000484
9 9.0 10294 0.220128 0.037537 2266.0 8028.0 0.038536 0.037265 0.710063 0.011853 0.035187 0.000484
10 10.0 106058 0.214703 0.386743 22771.0 83287.0 0.387249 0.386605 0.693980 0.005425 0.016083 0.000484
In [1238]:
plot_by_woe(df_temp)
# We plot the weight of evidence values, rotating the labels 90 degrees.
No description has been provided for this image
In [1239]:
# We create the following categories:
# 0 , 1, 2 - 4, 5 - 7, 8 - 9, 10
df_inputs_prepr['emp_length_int:0'] = np.where(df_inputs_prepr['emp_length_int'].isin([0]), 1, 0)
df_inputs_prepr['emp_length_int:1'] = np.where(df_inputs_prepr['emp_length_int'].isin([1]), 1, 0)
df_inputs_prepr['emp_length_int:2-4'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(2, 5)), 1, 0)
df_inputs_prepr['emp_length_int:5-7'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(5, 8)), 1, 0)
df_inputs_prepr['emp_length_int:8-9'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(8, 10)), 1, 0)
df_inputs_prepr['emp_length_int:10'] = np.where(df_inputs_prepr['emp_length_int'].isin([10]), 1, 0)

Variable: 'dti'¶

In [1240]:
# int_rate
df_inputs_prepr['dti_factor'] = pd.cut(df_inputs_prepr['dti'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

df_temp = woe_ordered_continuous(df_inputs_prepr, 'dti_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1240]:
dti_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.999, 19.98] 164167 0.183502 0.598638 30125.0 134042.0 0.512313 0.622201 0.600696 NaN NaN inf
1 (19.98, 39.96] 108557 0.259136 0.395855 28131.0 80426.0 0.478402 0.373324 0.824818 0.075634 0.224122 inf
2 (39.96, 59.94] 1108 0.349278 0.004040 387.0 721.0 0.006581 0.003347 1.087383 0.090142 0.262565 inf
3 (59.94, 79.92] 204 0.392157 0.000744 80.0 124.0 0.001360 0.000576 1.213032 0.042879 0.125649 inf
4 (79.92, 99.9] 76 0.421053 0.000277 32.0 44.0 0.000544 0.000204 1.298691 0.028896 0.085659 inf
5 (99.9, 119.88] 32 0.406250 0.000117 13.0 19.0 0.000221 0.000088 1.254684 0.014803 0.044007 inf
6 (119.88, 139.86] 26 0.538462 0.000095 14.0 12.0 0.000238 0.000056 1.662846 0.132212 0.408161 inf
7 (139.86, 159.84] 5 0.400000 0.000018 2.0 3.0 0.000034 0.000014 1.236185 0.138462 0.426660 inf
8 (159.84, 179.82] 6 0.166667 0.000022 1.0 5.0 0.000017 0.000023 0.549702 0.233333 0.686483 inf
9 (179.82, 199.8] 4 0.000000 0.000015 0.0 4.0 0.000000 0.000019 0.000000 0.166667 0.549702 inf
10 (199.8, 219.78] 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.200000 0.650199 inf
11 (219.78, 239.76] 5 0.000000 0.000018 0.0 5.0 0.000000 0.000023 0.000000 0.200000 0.650199 inf
12 (239.76, 259.74] 6 0.666667 0.000022 4.0 2.0 0.000068 0.000009 2.119548 0.666667 2.119548 inf
13 (259.74, 279.72] 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040928 0.333333 1.078620 inf
14 (279.72, 299.7] 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040928 0.000000 0.000000 inf
15 (299.7, 319.68] 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.133333 0.390729 inf
16 (319.68, 339.66] 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.300000 0.889607 inf
17 (339.66, 359.64] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
18 (359.64, 379.62] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
19 (379.62, 399.6] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
20 (399.6, 419.58] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
21 (419.58, 439.56] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
22 (439.56, 459.54] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
23 (459.54, 479.52] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
24 (479.52, 499.5] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf NaN NaN inf
25 (499.5, 519.48] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
26 (519.48, 539.46] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
27 (539.46, 559.44] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
28 (559.44, 579.42] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf NaN NaN inf
29 (579.42, 599.4] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
30 (599.4, 619.38] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
31 (619.38, 639.36] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
32 (639.36, 659.34] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
33 (659.34, 679.32] 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 NaN NaN inf
34 (679.32, 699.3] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
35 (699.3, 719.28] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
36 (719.28, 739.26] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
37 (739.26, 759.24] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
38 (759.24, 779.22] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
39 (779.22, 799.2] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
40 (799.2, 819.18] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
41 (819.18, 839.16] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
42 (839.16, 859.14] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
43 (859.14, 879.12] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
44 (879.12, 899.1] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf NaN NaN inf
45 (899.1, 919.08] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
46 (919.08, 939.06] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
47 (939.06, 959.04] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
48 (959.04, 979.02] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
49 (979.02, 999.0] 9 0.333333 0.000033 3.0 6.0 0.000051 0.000028 1.040928 NaN NaN inf
In [1241]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1242]:
# one category for everyone with 'dti' higher than 40 is considered. 
# the categories of everyone with 'dti' less or equal 40.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['dti'] <= 40., : ]
#loan_data_temp = loan_data_temp.reset_index(drop = True)
#df_inputs_prepr_temp
df_inputs_prepr_temp["dti_factor"] = pd.cut(df_inputs_prepr_temp['dti'], 2)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'dti_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3006790741.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp["dti_factor"] = pd.cut(df_inputs_prepr_temp['dti'], 2)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1242]:
dti_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.04, 20.0] 164391 0.183471 0.602712 30161.0 134230.0 0.51767 0.625813 0.602782 NaN NaN 0.024369
1 (20.0, 40.0] 108361 0.259337 0.397288 28102.0 80259.0 0.48233 0.374187 0.828119 0.075866 0.225336 0.024369
In [1243]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1244]:
# We create the following categories:  '<= 10.', '10. - 20.', '20. - 30.', '30. - 40.', (> 40.)
df_inputs_prepr['dti:<=10'] = np.where((df_inputs_prepr['dti'] <= 10.), 1, 0)
df_inputs_prepr['dti:10-20'] = np.where((df_inputs_prepr['dti'] > 10.) & (df_inputs_prepr['dti'] <= 20.), 1, 0)
df_inputs_prepr['dti:20-30'] = np.where((df_inputs_prepr['dti'] > 20.) & (df_inputs_prepr['dti'] <= 30.), 1, 0)
df_inputs_prepr['dti:30-40'] = np.where((df_inputs_prepr['dti'] > 30.) & (df_inputs_prepr['dti'] <= 40.), 1, 0)
df_inputs_prepr['dti:>40'] = np.where((df_inputs_prepr['dti'] > 40.), 1, 0)
In [1245]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['dti_factor'])

Variable: 'min_mths_since_delinquency'¶

In [1246]:
# one category will be created for 'min_mths_since_delinquency' = 999 corresponds to missing values. 
#***********************************************************************************************
# the categories of everyone with 'min_mths_since_delinquency' less 300 (max value = 226).
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['min_mths_since_delinquency'] <= 500, : ]
#loan_data_temp = loan_data_temp.reset_index(drop = True)
#df_inputs_prepr_temp
df_inputs_prepr_temp['min_mths_since_delinquency_factor'] = pd.cut(df_inputs_prepr_temp['min_mths_since_delinquency'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'min_mths_since_delinquency_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1149940471.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['min_mths_since_delinquency_factor'] = pd.cut(df_inputs_prepr_temp['min_mths_since_delinquency'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1246]:
min_mths_since_delinquency_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.156, 3.12] 4315 0.221089 0.031104 954.0 3361.0 0.030663 0.031232 0.684007 NaN NaN 0.001227
1 (3.12, 6.24] 6932 0.242787 0.049969 1683.0 5249.0 0.054095 0.048776 0.746239 0.021698 0.062232 0.001227
2 (6.24, 9.36] 7868 0.235511 0.056716 1853.0 6015.0 0.059559 0.055894 0.725409 0.007276 0.020830 0.001227
3 (9.36, 12.48] 7648 0.231041 0.055130 1767.0 5881.0 0.056795 0.054649 0.712594 0.004470 0.012815 0.001227
4 (12.48, 15.6] 7595 0.228440 0.054748 1735.0 5860.0 0.055766 0.054453 0.705130 0.002601 0.007464 0.001227
5 (15.6, 18.72] 7320 0.223497 0.052766 1636.0 5684.0 0.052584 0.052818 0.690932 0.004942 0.014198 0.001227
6 (18.72, 21.84] 7057 0.223466 0.050870 1577.0 5480.0 0.050688 0.050922 0.690843 0.000031 0.000090 0.001227
7 (21.84, 24.96] 6578 0.223776 0.047417 1472.0 5106.0 0.047313 0.047447 0.691734 0.000310 0.000892 0.001227
8 (24.96, 28.08] 9075 0.225785 0.065416 2049.0 7026.0 0.065859 0.065288 0.697507 0.002009 0.005773 0.001227
9 (28.08, 31.2] 6289 0.223565 0.045334 1406.0 4883.0 0.045192 0.045375 0.691127 0.002220 0.006380 0.001227
10 (31.2, 34.32] 6001 0.221296 0.043258 1328.0 4673.0 0.042684 0.043423 0.684604 0.002269 0.006523 0.001227
11 (34.32, 37.44] 5808 0.217631 0.041866 1264.0 4544.0 0.040627 0.042225 0.674053 0.003666 0.010551 0.001227
12 (37.44, 40.56] 5682 0.228793 0.040958 1300.0 4382.0 0.041785 0.040719 0.706143 0.011162 0.032090 0.001227
13 (40.56, 43.68] 5444 0.216385 0.039243 1178.0 4266.0 0.037863 0.039641 0.670464 0.012408 0.035679 0.001227
14 (43.68, 46.8] 5261 0.224672 0.037923 1182.0 4079.0 0.037992 0.037904 0.694309 0.008287 0.023845 0.001227
15 (46.8, 49.92] 4929 0.218097 0.035530 1075.0 3854.0 0.034553 0.035813 0.675395 0.006575 0.018914 0.001227
16 (49.92, 53.04] 4738 0.225623 0.034153 1069.0 3669.0 0.034360 0.034094 0.697040 0.007526 0.021645 0.001227
17 (53.04, 56.16] 3677 0.220560 0.026505 811.0 2866.0 0.026067 0.026632 0.682486 0.005062 0.014555 0.001227
18 (56.16, 59.28] 3671 0.211114 0.026462 775.0 2896.0 0.024910 0.026911 0.655265 0.009446 0.027221 0.001227
19 (59.28, 62.4] 3379 0.215448 0.024357 728.0 2651.0 0.023399 0.024634 0.667765 0.004334 0.012500 0.001227
20 (62.4, 65.52] 3343 0.227640 0.024098 761.0 2582.0 0.024460 0.023993 0.702834 0.012191 0.035068 0.001227
21 (65.52, 68.64] 3395 0.213844 0.024473 726.0 2669.0 0.023335 0.024801 0.663140 0.013796 0.039694 0.001227
22 (68.64, 71.76] 3173 0.220611 0.022872 700.0 2473.0 0.022499 0.022980 0.682633 0.006768 0.019493 0.001227
23 (71.76, 74.88] 2952 0.226287 0.021279 668.0 2284.0 0.021471 0.021224 0.698949 0.005676 0.016317 0.001227
24 (74.88, 78.0] 3656 0.210613 0.026354 770.0 2886.0 0.024749 0.026818 0.653817 0.015675 0.045132 0.001227
25 (78.0, 81.12] 2322 0.212748 0.016738 494.0 1828.0 0.015878 0.016986 0.659978 0.002135 0.006161 0.001227
26 (81.12, 84.24] 393 0.259542 0.002833 102.0 291.0 0.003278 0.002704 0.794086 0.046794 0.134107 0.001227
27 (84.24, 87.36] 59 0.186441 0.000425 11.0 48.0 0.000354 0.000446 0.583710 0.073101 0.210376 0.001227
28 (87.36, 90.48] 29 0.241379 0.000209 7.0 22.0 0.000225 0.000204 0.742212 0.054939 0.158502 0.001227
29 (90.48, 93.6] 30 0.133333 0.000216 4.0 26.0 0.000129 0.000242 0.426670 0.108046 0.315542 0.001227
30 (93.6, 96.72] 22 0.181818 0.000159 4.0 18.0 0.000129 0.000167 0.570220 0.048485 0.143550 0.001227
31 (96.72, 99.84] 21 0.238095 0.000151 5.0 16.0 0.000161 0.000149 0.732812 0.056277 0.162591 0.001227
32 (99.84, 102.96] 11 0.454545 0.000079 5.0 6.0 0.000161 0.000056 1.356470 0.216450 0.623658 0.001227
33 (102.96, 106.08] 13 0.230769 0.000094 3.0 10.0 0.000096 0.000093 0.711815 0.223776 0.644655 0.001227
34 (106.08, 109.2] 10 0.300000 0.000072 3.0 7.0 0.000096 0.000065 0.909230 0.069231 0.197414 0.001227
35 (109.2, 112.32] 7 0.285714 0.000050 2.0 5.0 0.000064 0.000046 0.868604 0.014286 0.040625 0.001227
36 (112.32, 115.44] 7 0.285714 0.000050 2.0 5.0 0.000064 0.000046 0.868604 0.000000 0.000000 0.001227
37 (115.44, 118.56] 2 0.000000 0.000014 0.0 2.0 0.000000 0.000019 0.000000 0.285714 0.868604 0.001227
38 (118.56, 121.68] 2 0.000000 0.000014 0.0 2.0 0.000000 0.000019 0.000000 0.000000 0.000000 0.001227
39 (121.68, 124.8] 2 0.500000 0.000014 1.0 1.0 0.000032 0.000009 1.494914 0.500000 1.494914 0.001227
40 (124.8, 127.92] 3 0.666667 0.000022 2.0 1.0 0.000064 0.000009 2.069127 0.166667 0.574213 0.001227
41 (127.92, 131.04] 1 0.000000 0.000007 0.0 1.0 0.000000 0.000009 0.000000 0.666667 2.069127 0.001227
42 (131.04, 134.16] 1 0.000000 0.000007 0.0 1.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.001227
43 (134.16, 137.28] 2 0.000000 0.000014 0.0 2.0 0.000000 0.000019 0.000000 0.000000 0.000000 0.001227
44 (137.28, 140.4] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001227
45 (140.4, 143.52] 2 0.000000 0.000014 0.0 2.0 0.000000 0.000019 0.000000 NaN NaN 0.001227
46 (143.52, 146.64] 1 0.000000 0.000007 0.0 1.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.001227
47 (146.64, 149.76] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001227
48 (149.76, 152.88] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001227
49 (152.88, 156.0] 1 0.000000 0.000007 0.0 1.0 0.000000 0.000009 0.000000 NaN NaN 0.001227
In [1247]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1248]:
## We create the following categories:
# 'Missing', # <=20, # 20 - 40, # 40 - 80, # >80
df_inputs_prepr['min_mths_since_delinquency:Missing'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin([999]), 1, 0)
df_inputs_prepr['min_mths_since_delinquency:<=20'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin(range(21)), 1, 0)
df_inputs_prepr['min_mths_since_delinquency:20-40'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin(range(21, 41)), 1, 0)
df_inputs_prepr['min_mths_since_delinquency:40-80'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin(range(41, 81)), 1, 0)
df_inputs_prepr['min_mths_since_delinquency:>80'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin(range(81, int(df_inputs_prepr_temp['mths_since_earliest_cr_line'].max()))), 1, 0)

Variable: 'mths_since_earliest_cr_line'¶

In [1249]:
# mths_since_earliest_cr_line
df_inputs_prepr['mths_since_earliest_cr_line_factor'] = pd.cut(df_inputs_prepr['mths_since_earliest_cr_line'], 45)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'mths_since_earliest_cr_line_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1249]:
mths_since_earliest_cr_line_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (63.009, 86.022] 828 0.379227 0.003019 314.0 514.0 0.005340 0.002386 1.174995 NaN NaN 0.014181
1 (86.022, 108.044] 3257 0.312558 0.011877 1018.0 2239.0 0.017312 0.010393 0.980488 0.066669 0.194507 0.014181
2 (108.044, 130.067] 6589 0.274093 0.024027 1806.0 4783.0 0.030713 0.022202 0.868512 0.038464 0.111977 0.014181
3 (130.067, 152.089] 8358 0.249103 0.030478 2082.0 6276.0 0.035407 0.029132 0.795429 0.024991 0.073083 0.014181
4 (152.089, 174.111] 16774 0.247168 0.061167 4146.0 12628.0 0.070508 0.058617 0.789754 0.001934 0.005675 0.014181
5 (174.111, 196.133] 26766 0.247105 0.097603 6614.0 20152.0 0.112479 0.093542 0.789567 0.000064 0.000187 0.014181
6 (196.133, 218.156] 31526 0.231618 0.114960 7302.0 24224.0 0.124179 0.112444 0.744016 0.015486 0.045551 0.014181
7 (218.156, 240.178] 33108 0.211278 0.120729 6995.0 26113.0 0.118959 0.121212 0.683807 0.020340 0.060208 0.014181
8 (240.178, 262.2] 29997 0.203487 0.109385 6104.0 23893.0 0.103806 0.110907 0.660609 0.007791 0.023199 0.014181
9 (262.2, 284.222] 24356 0.199951 0.088815 4870.0 19486.0 0.082820 0.090451 0.650051 0.003536 0.010557 0.014181
10 (284.222, 306.244] 19873 0.201027 0.072467 3995.0 15878.0 0.067940 0.073703 0.653265 0.001076 0.003214 0.014181
11 (306.244, 328.267] 16852 0.188998 0.061451 3185.0 13667.0 0.054165 0.063440 0.617236 0.012028 0.036029 0.014181
12 (328.267, 350.289] 12111 0.189910 0.044163 2300.0 9811.0 0.039114 0.045541 0.619974 0.000912 0.002739 0.014181
13 (350.289, 372.311] 9398 0.185359 0.034270 1742.0 7656.0 0.029625 0.035538 0.606288 0.004551 0.013686 0.014181
14 (372.311, 394.333] 8182 0.180640 0.029836 1478.0 6704.0 0.025135 0.031119 0.592064 0.004718 0.014224 0.014181
15 (394.333, 416.356] 6746 0.189149 0.024599 1276.0 5470.0 0.021700 0.025391 0.617689 0.008509 0.025625 0.014181
16 (416.356, 438.378] 5158 0.184762 0.018809 953.0 4205.0 0.016207 0.019519 0.604490 0.004388 0.013198 0.014181
17 (438.378, 460.4] 4077 0.170223 0.014867 694.0 3383.0 0.011802 0.015703 0.560519 0.014538 0.043972 0.014181
18 (460.4, 482.422] 2610 0.167433 0.009517 437.0 2173.0 0.007432 0.010087 0.552035 0.002790 0.008484 0.014181
19 (482.422, 504.444] 1839 0.177814 0.006706 327.0 1512.0 0.005561 0.007018 0.583525 0.010381 0.031490 0.014181
20 (504.444, 526.467] 1718 0.185099 0.006265 318.0 1400.0 0.005408 0.006499 0.605506 0.007285 0.021982 0.014181
21 (526.467, 548.489] 1273 0.186174 0.004642 237.0 1036.0 0.004030 0.004809 0.608744 0.001075 0.003237 0.014181
22 (548.489, 570.511] 813 0.206642 0.002965 168.0 645.0 0.002857 0.002994 0.670013 0.020468 0.061269 0.014181
23 (570.511, 592.533] 623 0.191011 0.002272 119.0 504.0 0.002024 0.002339 0.623281 0.015631 0.046732 0.014181
24 (592.533, 614.556] 436 0.199541 0.001590 87.0 349.0 0.001480 0.001620 0.648828 0.008530 0.025547 0.014181
25 (614.556, 636.578] 338 0.230769 0.001233 78.0 260.0 0.001326 0.001207 0.741511 0.031228 0.092683 0.014181
26 (636.578, 658.6] 270 0.255556 0.000985 69.0 201.0 0.001173 0.000933 0.814339 0.024786 0.072828 0.014181
27 (658.6, 680.622] 141 0.205674 0.000514 29.0 112.0 0.000493 0.000520 0.667128 0.049882 0.147211 0.014181
28 (680.622, 702.644] 101 0.277228 0.000368 28.0 73.0 0.000476 0.000339 0.877653 0.071554 0.210525 0.014181
29 (702.644, 724.667] 47 0.255319 0.000171 12.0 35.0 0.000204 0.000162 0.813647 0.021909 0.064007 0.014181
30 (724.667, 746.689] 28 0.214286 0.000102 6.0 22.0 0.000102 0.000102 0.692740 0.041033 0.120906 0.014181
31 (746.689, 768.711] 18 0.388889 0.000066 7.0 11.0 0.000119 0.000051 1.203403 0.174603 0.510663 0.014181
32 (768.711, 790.733] 11 0.272727 0.000040 3.0 8.0 0.000051 0.000037 0.864527 0.116162 0.338877 0.014181
33 (790.733, 812.756] 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040928 0.060606 0.176401 0.014181
34 (812.756, 834.778] 4 0.500000 0.000015 2.0 2.0 0.000034 0.000009 1.539806 0.166667 0.498878 0.014181
35 (834.778, 856.8] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.500000 1.539806 0.014181
36 (856.8, 878.822] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.014181
37 (878.822, 900.844] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.014181
38 (900.844, 922.867] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.014181
39 (922.867, 944.889] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.014181
40 (944.889, 966.911] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.014181
41 (966.911, 988.933] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.014181
42 (988.933, 1010.956] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.014181
43 (1010.956, 1032.978] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.014181
44 (1032.978, 1055.0] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.014181
In [1250]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1251]:
# We create the following categories:
# <= 120, # 121 - 200, # 201 - 260, # 261 - 320, # 320 - 400, # 401 - 600, # => 601
df_inputs_prepr['mths_since_earliest_cr_line:<=120'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(121)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:121-200'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(121, 201)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:201-260'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(201,261)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:261-320'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(261, 321)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:321-400'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(321, 401)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:401-600'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(401, 601)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:>=601'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(601, int(df_inputs_prepr['mths_since_earliest_cr_line'].max()))), 1, 0)
In [1252]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['mths_since_earliest_cr_line_factor'])
# Drop the provisory feature

Variable: 'delinq_2yrs'¶

In [1253]:
df_inputs_prepr['delinq_2yrs'].unique()
Out[1253]:
array([ 0.,  1.,  2.,  3.,  4.,  5., 14.,  7.,  6.,  9., 10.,  8., 13.,
       11., 12., 18., 16., 15., 20., 17., 19., 36., 26., 27., 22., 24.])
In [1254]:
# delinq_2yrs
df_temp = woe_ordered_continuous(df_inputs_prepr, 'delinq_2yrs', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1254]:
delinq_2yrs n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 221126 0.210726 0.806341 46597.0 174529.0 0.792439 0.810135 0.682165 NaN NaN inf
1 1.0 35193 0.223482 0.128332 7865.0 27328.0 0.133754 0.126852 0.719988 0.012756 0.037823 inf
2 2.0 10331 0.235989 0.037672 2438.0 7893.0 0.041461 0.036638 0.756893 0.012507 0.036905 inf
3 3.0 3763 0.257242 0.013722 968.0 2795.0 0.016462 0.012974 0.819275 0.021253 0.062381 inf
4 4.0 1651 0.245306 0.006020 405.0 1246.0 0.006888 0.005784 0.784287 0.011936 0.034988 inf
5 5.0 870 0.237931 0.003172 207.0 663.0 0.003520 0.003078 0.762610 0.007375 0.021677 inf
6 6.0 503 0.228628 0.001834 115.0 388.0 0.001956 0.001801 0.735194 0.009303 0.027417 inf
7 7.0 295 0.254237 0.001076 75.0 220.0 0.001275 0.001021 0.810478 0.025609 0.075285 inf
8 8.0 158 0.259494 0.000576 41.0 117.0 0.000697 0.000543 0.825865 0.005256 0.015387 inf
9 9.0 114 0.271930 0.000416 31.0 83.0 0.000527 0.000385 0.862200 0.012436 0.036335 inf
10 10.0 72 0.222222 0.000263 16.0 56.0 0.000272 0.000260 0.716262 0.049708 0.145938 inf
11 11.0 44 0.340909 0.000160 15.0 29.0 0.000255 0.000135 1.062988 0.118687 0.346727 inf
12 12.0 31 0.193548 0.000113 6.0 25.0 0.000102 0.000116 0.630891 0.147361 0.432097 inf
13 13.0 25 0.240000 0.000091 6.0 19.0 0.000102 0.000088 0.768697 0.046452 0.137806 inf
14 14.0 19 0.315789 0.000069 6.0 13.0 0.000102 0.000060 0.989887 0.075789 0.221191 inf
15 15.0 13 0.307692 0.000047 4.0 9.0 0.000068 0.000042 0.966339 0.008097 0.023548 inf
16 16.0 6 0.333333 0.000022 2.0 4.0 0.000034 0.000019 1.040928 0.025641 0.074589 inf
17 17.0 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.133333 0.390729 inf
18 18.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.200000 0.650199 inf
19 19.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.500000 1.539806 inf
20 20.0 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.300000 0.889607 inf
21 22.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.200000 0.650199 inf
22 24.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
23 26.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 inf
24 27.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
25 36.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.000000 NaN inf
In [1255]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1256]:
# Categories: 0, 1, 2-9, >=10
df_inputs_prepr['delinq_2yrs:0'] = np.where((df_inputs_prepr['delinq_2yrs'] == 0), 1, 0)
df_inputs_prepr['delinq_2yrs:1'] = np.where((df_inputs_prepr['delinq_2yrs'] == 1), 1, 0)
df_inputs_prepr['delinq_2yrs:2-9'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 2) & (df_inputs_prepr['delinq_2yrs'] <= 9), 1, 0)
df_inputs_prepr['delinq_2yrs:>=10'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 10), 1, 0)

Variable: 'inq_last_6mths'¶

In [1257]:
df_inputs_prepr['inq_last_6mths'].unique()
Out[1257]:
array([1., 2., 0., 3., 6., 4., 5., 7., 8.])
In [1258]:
# inq_last_6mths
df_temp = woe_ordered_continuous(df_inputs_prepr, 'inq_last_6mths', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1258]:
inq_last_6mths n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 157016 0.195547 0.572562 30704.0 126312.0 0.522159 0.586320 0.636879 NaN NaN 0.010856
1 1.0 74667 0.227222 0.272275 16966.0 57701.0 0.288528 0.267839 0.731042 0.031675 0.094163 0.010856
2 2.0 27967 0.253477 0.101982 7089.0 20878.0 0.120557 0.096912 0.808252 0.026255 0.077210 0.010856
3 3.0 10517 0.275554 0.038350 2898.0 7619.0 0.049284 0.035366 0.872772 0.022077 0.064520 0.010856
4 4.0 2849 0.282906 0.010389 806.0 2043.0 0.013707 0.009483 0.894204 0.007352 0.021432 0.010856
5 5.0 1006 0.290258 0.003668 292.0 714.0 0.004966 0.003314 0.915616 0.007352 0.021412 0.010856
6 6.0 200 0.230000 0.000729 46.0 154.0 0.000782 0.000715 0.739242 0.060258 0.176374 0.010856
7 7.0 6 0.166667 0.000022 1.0 5.0 0.000017 0.000023 0.549702 0.063333 0.189540 0.010856
8 8.0 6 0.000000 0.000022 0.0 6.0 0.000000 0.000028 0.000000 0.166667 0.549702 0.010856
In [1259]:
plot_by_woe(df_temp)
# We plot the weight of evidence values.
No description has been provided for this image
In [1260]:
# Categories: 0, 1 - 2, 3 - 5, >= 6
df_inputs_prepr['inq_last_6mths:0'] = np.where((df_inputs_prepr['inq_last_6mths'] == 0), 1, 0)
df_inputs_prepr['inq_last_6mths:1-2'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 1) & (df_inputs_prepr['inq_last_6mths'] <= 2), 1, 0)
df_inputs_prepr['inq_last_6mths:3-5'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 3) & (df_inputs_prepr['inq_last_6mths'] <= 5), 1, 0)
df_inputs_prepr['inq_last_6mths:>=6'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 6), 1, 0)

Variable: 'collections_12_mths_ex_med'¶

In [1261]:
df_inputs_prepr['collections_12_mths_ex_med'].unique()
Out[1261]:
array([ 0.,  1.,  2.,  3.,  7.,  4.,  6.,  5., 14.,  9.])
In [1262]:
# open_acc
df_temp = woe_ordered_continuous(df_inputs_prepr, 'collections_12_mths_ex_med', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1262]:
collections_12_mths_ex_med n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 269936 0.213421 0.984327 57610.0 212326.0 0.979729 0.985582 0.690173 NaN NaN 0.001155
1 1.0 3999 0.278570 0.014582 1114.0 2885.0 0.018945 0.013392 0.881566 0.065149 0.191393 0.001155
2 2.0 250 0.268000 0.000912 67.0 183.0 0.001139 0.000849 0.850727 0.010570 0.030838 0.001155
3 3.0 29 0.241379 0.000106 7.0 22.0 0.000119 0.000102 0.772752 0.026621 0.077975 0.001155
4 4.0 11 0.363636 0.000040 4.0 7.0 0.000068 0.000032 1.129314 0.122257 0.356562 0.001155
5 5.0 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.363636 1.129314 0.001155
6 6.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.001155
7 7.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.001155
8 9.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.001155
9 14.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.001155
In [1263]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1264]:
# Categories: '0', '1', '>=2'
df_inputs_prepr['collections_12_mths_ex_med:0'] = np.where((df_inputs_prepr['collections_12_mths_ex_med'] == 0), 1, 0)
df_inputs_prepr['collections_12_mths_ex_med:1'] = np.where((df_inputs_prepr['collections_12_mths_ex_med'] == 1), 1, 0)
df_inputs_prepr['collections_12_mths_ex_med:>=2'] = np.where((df_inputs_prepr['collections_12_mths_ex_med'] >= 2), 1, 0)

Variable: 'chargeoff_within_12_mths'¶

In [1265]:
df_inputs_prepr['chargeoff_within_12_mths'].unique()
Out[1265]:
array([0., 1., 3., 2., 4., 5., 6., 8., 7.])
In [1266]:
# pub_rec
df_temp = woe_ordered_continuous(df_inputs_prepr, 'chargeoff_within_12_mths', df_targets_prepr)
# We calculate weight of evidence.
df_temp
Out[1266]:
chargeoff_within_12_mths n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 271937 0.214256 0.991624 58264.0 213673.0 0.990851 0.991835 0.692651 NaN NaN inf
1 1.0 2080 0.236058 0.007585 491.0 1589.0 0.008350 0.007376 0.757096 0.021802 0.064445 inf
2 2.0 168 0.220238 0.000613 37.0 131.0 0.000629 0.000608 0.710388 0.015820 0.046708 inf
3 3.0 27 0.185185 0.000098 5.0 22.0 0.000085 0.000102 0.605766 0.035053 0.104622 inf
4 4.0 13 0.307692 0.000047 4.0 9.0 0.000068 0.000042 0.966339 0.122507 0.360573 inf
5 5.0 5 0.000000 0.000018 0.0 5.0 0.000000 0.000023 0.000000 0.307692 0.966339 inf
6 6.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
7 7.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 inf
8 8.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
In [1267]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1268]:
# Categories '0-2', '3-4', '>=5'
df_inputs_prepr['chargeoff_within_12_mths:0'] = np.where((df_inputs_prepr['chargeoff_within_12_mths'] == 0), 1, 0)
df_inputs_prepr['chargeoff_within_12_mths:1'] = np.where((df_inputs_prepr['chargeoff_within_12_mths'] == 1), 1, 0)
df_inputs_prepr['chargeoff_within_12_mths:>=2'] = np.where((df_inputs_prepr['chargeoff_within_12_mths'] >= 2), 1, 0)

Variable: 'total_acc'¶

In [1269]:
# total_acc
df_inputs_prepr['total_acc_factor'] = pd.cut(df_inputs_prepr['total_acc'], 58)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_acc_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1269]:
total_acc_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (1.856, 4.483] 1324 0.270393 0.004828 358.0 966.0 0.006088 0.004484 0.857713 NaN NaN inf
1 (4.483, 6.966] 3984 0.231928 0.014528 924.0 3060.0 0.015714 0.014204 0.744928 0.038465 0.112786 inf
2 (6.966, 9.448] 12089 0.231367 0.044083 2797.0 9292.0 0.047566 0.043132 0.743275 0.000560 0.001652 inf
3 (9.448, 11.931] 12378 0.224592 0.045137 2780.0 9598.0 0.047277 0.044552 0.723270 0.006775 0.020005 inf
4 (11.931, 14.414] 23709 0.229449 0.086455 5440.0 18269.0 0.092514 0.084802 0.737615 0.004857 0.014345 inf
5 (14.414, 16.897] 17954 0.223293 0.065470 4009.0 13945.0 0.068178 0.064730 0.719429 0.006156 0.018187 inf
6 (16.897, 19.379] 29452 0.216148 0.107397 6366.0 23086.0 0.108262 0.107161 0.698267 0.007145 0.021161 inf
7 (19.379, 21.862] 19656 0.212810 0.071676 4183.0 15473.0 0.071137 0.071823 0.688359 0.003338 0.009908 inf
8 (21.862, 24.345] 28961 0.211629 0.105607 6129.0 22832.0 0.104231 0.105982 0.684851 0.001181 0.003509 inf
9 (24.345, 26.828] 17727 0.207875 0.064642 3685.0 14042.0 0.062668 0.065181 0.673684 0.003754 0.011167 inf
10 (26.828, 29.31] 23717 0.204748 0.086485 4856.0 18861.0 0.082582 0.087550 0.664368 0.003127 0.009316 inf
11 (29.31, 31.793] 13501 0.205614 0.049232 2776.0 10725.0 0.047209 0.049784 0.666951 0.000867 0.002583 inf
12 (31.793, 34.276] 17189 0.202106 0.062680 3474.0 13715.0 0.059080 0.063663 0.656488 0.003508 0.010463 inf
13 (34.276, 36.759] 9448 0.205758 0.034452 1944.0 7504.0 0.033060 0.034832 0.667378 0.003652 0.010891 inf
14 (36.759, 39.241] 11538 0.209828 0.042074 2421.0 9117.0 0.041172 0.042320 0.679496 0.004071 0.012118 inf
15 (39.241, 41.724] 5976 0.214692 0.021792 1283.0 4693.0 0.021819 0.021784 0.693947 0.004864 0.014450 inf
16 (41.724, 44.207] 7226 0.199972 0.026350 1445.0 5781.0 0.024574 0.026834 0.650116 0.014720 0.043831 inf
17 (44.207, 46.69] 3663 0.204750 0.013357 750.0 2913.0 0.012755 0.013522 0.664375 0.004778 0.014259 inf
18 (46.69, 49.172] 4301 0.214369 0.015684 922.0 3379.0 0.015680 0.015685 0.692987 0.009619 0.028612 inf
19 (49.172, 51.655] 2232 0.216846 0.008139 484.0 1748.0 0.008231 0.008114 0.700336 0.002477 0.007349 inf
20 (51.655, 54.138] 2389 0.226036 0.008712 540.0 1849.0 0.009183 0.008583 0.727538 0.009190 0.027202 inf
21 (54.138, 56.621] 1191 0.205709 0.004343 245.0 946.0 0.004167 0.004391 0.667234 0.020327 0.060304 inf
22 (56.621, 59.103] 1380 0.217391 0.005032 300.0 1080.0 0.005102 0.005013 0.701953 0.011682 0.034719 inf
23 (59.103, 61.586] 720 0.227778 0.002625 164.0 556.0 0.002789 0.002581 0.732683 0.010386 0.030729 inf
24 (61.586, 64.069] 1054 0.183112 0.003843 193.0 861.0 0.003282 0.003997 0.599520 0.044666 0.133163 inf
25 (64.069, 66.552] 288 0.232639 0.001050 67.0 221.0 0.001139 0.001026 0.747024 0.049527 0.147504 inf
26 (66.552, 69.034] 331 0.211480 0.001207 70.0 261.0 0.001190 0.001212 0.684408 0.021159 0.062616 inf
27 (69.034, 71.517] 167 0.209581 0.000609 35.0 132.0 0.000595 0.000613 0.678760 0.001900 0.005648 inf
28 (71.517, 74.0] 183 0.234973 0.000667 43.0 140.0 0.000731 0.000650 0.753901 0.025392 0.075141 inf
29 (74.0, 76.483] 98 0.204082 0.000357 20.0 78.0 0.000340 0.000362 0.662382 0.030891 0.091519 inf
30 (76.483, 78.966] 72 0.208333 0.000263 15.0 57.0 0.000255 0.000265 0.675048 0.004252 0.012666 inf
31 (78.966, 81.448] 86 0.232558 0.000314 20.0 66.0 0.000340 0.000306 0.746786 0.024225 0.071738 inf
32 (81.448, 83.931] 47 0.255319 0.000171 12.0 35.0 0.000204 0.000162 0.813647 0.022761 0.066860 inf
33 (83.931, 86.414] 59 0.203390 0.000215 12.0 47.0 0.000204 0.000218 0.660319 0.051929 0.153328 inf
34 (86.414, 88.897] 26 0.307692 0.000095 8.0 18.0 0.000136 0.000084 0.966339 0.104302 0.306020 inf
35 (88.897, 91.379] 34 0.294118 0.000124 10.0 24.0 0.000170 0.000111 0.926849 0.013575 0.039490 inf
36 (91.379, 93.862] 12 0.250000 0.000044 3.0 9.0 0.000051 0.000042 0.798060 0.044118 0.128789 inf
37 (93.862, 96.345] 26 0.346154 0.000095 9.0 17.0 0.000153 0.000079 1.078273 0.096154 0.280212 inf
38 (96.345, 98.828] 5 0.400000 0.000018 2.0 3.0 0.000034 0.000014 1.236185 0.053846 0.157913 inf
39 (98.828, 101.31] 9 0.222222 0.000033 2.0 7.0 0.000034 0.000032 0.716262 0.177778 0.519924 inf
40 (101.31, 103.793] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.222222 0.716262 inf
41 (103.793, 106.276] 7 0.142857 0.000026 1.0 6.0 0.000017 0.000028 0.476616 0.142857 0.476616 inf
42 (106.276, 108.759] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.142857 0.476616 inf
43 (108.759, 111.241] 6 0.333333 0.000022 2.0 4.0 0.000034 0.000019 1.040928 0.333333 1.040928 inf
44 (111.241, 113.724] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.333333 1.040928 inf
45 (113.724, 116.207] 4 0.250000 0.000015 1.0 3.0 0.000017 0.000014 0.798060 0.250000 0.798060 inf
46 (116.207, 118.69] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
47 (118.69, 121.172] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf NaN NaN inf
48 (121.172, 123.655] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 1.000000 inf inf
49 (123.655, 126.138] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
50 (126.138, 128.621] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
51 (128.621, 131.103] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
52 (131.103, 133.586] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
53 (133.586, 136.069] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
54 (136.069, 138.552] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf NaN NaN inf
55 (138.552, 141.034] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
56 (141.034, 143.517] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
57 (143.517, 146.0] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
In [1270]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1271]:
# Categories: '<=20', '20-56', '>=57'
df_inputs_prepr['total_acc:<=20'] = np.where((df_inputs_prepr['total_acc'] <= 20), 1, 0)
df_inputs_prepr['total_acc:21-56'] = np.where((df_inputs_prepr['total_acc'] >= 21) & (df_inputs_prepr['total_acc'] <= 56), 1, 0)
df_inputs_prepr['total_acc:>=57'] = np.where((df_inputs_prepr['total_acc'] >= 57), 1, 0)
In [1272]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['total_acc_factor'])
# Drop the provisory feature

Variable: 'delinq_amnt'¶

In [ ]:
# unique values
df_inputs_prepr['delinq_amnt'].nunique()
In [ ]:
# number of observations with 0 value
df_inputs_prepr['delinq_amnt'].value_counts()[0]
In [ ]:
# 'delinq_amnt'
df_inputs_prepr['delinq_amnt_factor'] = pd.cut(df_inputs_prepr['delinq_amnt'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

# acc_now_delinq
df_temp = woe_ordered_continuous(df_inputs_prepr, 'delinq_amnt_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
In [1276]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1277]:
# Categories: '0', '>=1'
df_inputs_prepr['delinq_amnt:0'] = np.where((df_inputs_prepr['delinq_amnt'] == 0), 1, 0)
df_inputs_prepr['delinq_amnt:>=1'] = np.where((df_inputs_prepr['delinq_amnt'] >= 1), 1, 0)
In [1278]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['delinq_amnt_factor'])
# Drop the provisory feature

Variable: 'num_accts_ever_120_pd'¶

In [1279]:
# unique values
df_inputs_prepr['num_accts_ever_120_pd'].unique()
Out[1279]:
array([ 0.,  2.,  1.,  3.,  4., 16.,  7.,  5., 18.,  9., 11.,  6., 13.,
       12., 23., 10.,  8., 15., 14., 26., 20., 34., 19., 17., 27., 24.,
       22., 28., 25., 29., 21., 30.])
In [1280]:
# 'num_accts_ever_120_pd'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_accts_ever_120_pd', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1280]:
num_accts_ever_120_pd n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 212322 0.209159 0.774237 44409.0 167913.0 0.755229 0.779425 0.677504 NaN NaN inf
1 1.0 32932 0.236335 0.120087 7783.0 25149.0 0.132359 0.116738 0.757914 0.027177 0.080410 inf
2 2.0 13319 0.228771 0.048568 3047.0 10272.0 0.051818 0.047681 0.735615 0.007565 0.022299 inf
3 3.0 6079 0.230137 0.022167 1399.0 4680.0 0.023792 0.021724 0.739645 0.001366 0.004030 inf
4 4.0 3578 0.229737 0.013047 822.0 2756.0 0.013979 0.012793 0.738467 0.000399 0.001178 inf
5 5.0 2173 0.218132 0.007924 474.0 1699.0 0.008061 0.007886 0.704148 0.011606 0.034319 inf
6 6.0 1368 0.230994 0.004988 316.0 1052.0 0.005374 0.004883 0.742175 0.012863 0.038027 inf
7 7.0 830 0.231325 0.003027 192.0 638.0 0.003265 0.002961 0.743151 0.000331 0.000977 inf
8 8.0 501 0.195609 0.001827 98.0 403.0 0.001667 0.001871 0.637064 0.035717 0.106087 inf
9 9.0 351 0.225071 0.001280 79.0 272.0 0.001343 0.001263 0.724687 0.029462 0.087623 inf
10 10.0 261 0.206897 0.000952 54.0 207.0 0.000918 0.000961 0.670771 0.018175 0.053916 inf
11 11.0 140 0.221429 0.000511 31.0 109.0 0.000527 0.000506 0.713913 0.014532 0.043142 inf
12 12.0 109 0.266055 0.000397 29.0 80.0 0.000493 0.000371 0.845046 0.044626 0.131134 inf
13 13.0 67 0.268657 0.000244 18.0 49.0 0.000306 0.000227 0.852645 0.002602 0.007599 inf
14 14.0 67 0.164179 0.000244 11.0 56.0 0.000187 0.000260 0.542122 0.104478 0.310523 inf
15 15.0 28 0.285714 0.000102 8.0 20.0 0.000136 0.000093 0.902384 0.121535 0.360262 inf
16 16.0 28 0.178571 0.000102 5.0 23.0 0.000085 0.000107 0.585814 0.107143 0.316570 inf
17 17.0 17 0.411765 0.000062 7.0 10.0 0.000119 0.000046 1.271046 0.233193 0.685232 inf
18 18.0 15 0.333333 0.000055 5.0 10.0 0.000085 0.000046 1.040928 0.078431 0.230119 inf
19 19.0 10 0.300000 0.000036 3.0 7.0 0.000051 0.000032 0.943965 0.033333 0.096963 inf
20 20.0 7 0.428571 0.000026 3.0 4.0 0.000051 0.000019 1.321159 0.128571 0.377195 inf
21 21.0 6 0.333333 0.000022 2.0 4.0 0.000034 0.000019 1.040928 0.095238 0.280232 inf
22 22.0 6 0.333333 0.000022 2.0 4.0 0.000034 0.000019 1.040928 0.000000 0.000000 inf
23 23.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.166667 0.498878 inf
24 24.0 4 0.000000 0.000015 0.0 4.0 0.000000 0.000019 0.000000 0.500000 1.539806 inf
25 25.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
26 26.0 4 0.250000 0.000015 1.0 3.0 0.000017 0.000014 0.798060 0.250000 0.798060 inf
27 27.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.250000 0.741746 inf
28 28.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.500000 1.539806 inf
29 29.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 inf
30 30.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
31 34.0 2 1.000000 0.000007 2.0 0.0 0.000034 0.000000 inf 1.000000 inf inf
In [1281]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1282]:
# Categories: '0', '1-11', '>=12'
df_inputs_prepr['num_accts_ever_120_pd:0'] = np.where((df_inputs_prepr['num_accts_ever_120_pd'] == 0), 1, 0)
df_inputs_prepr['num_accts_ever_120_pd:1-11'] = np.where((df_inputs_prepr['num_accts_ever_120_pd'] >= 1) & (df_inputs_prepr['num_accts_ever_120_pd'] <= 11), 1, 0)
df_inputs_prepr['num_accts_ever_120_pd:>=12'] = np.where((df_inputs_prepr['num_accts_ever_120_pd'] >= 12), 1, 0)

Variable: 'num_tl_90g_dpd_24m'¶

In [1283]:
# unique values
df_inputs_prepr['num_tl_90g_dpd_24m'].unique()
Out[1283]:
array([ 0.,  1.,  2.,  4., 13.,  3.,  9.,  6., 14.,  5.,  7.,  8., 11.,
       10., 12., 18., 15., 20., 16., 36., 26., 22., 24., 17.])
In [1284]:
# 'num_tl_90g_dpd_24m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_tl_90g_dpd_24m', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1284]:
num_tl_90g_dpd_24m n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 259142 0.212849 0.944967 55158.0 203984.0 0.938029 0.946860 0.688473 NaN NaN inf
1 1.0 11266 0.242233 0.041082 2729.0 8537.0 0.046410 0.039627 0.775262 0.029385 0.086789 inf
2 2.0 2233 0.248992 0.008143 556.0 1677.0 0.009455 0.007784 0.795105 0.006759 0.019844 inf
3 3.0 623 0.240770 0.002272 150.0 473.0 0.002551 0.002196 0.770962 0.008222 0.024143 inf
4 4.0 349 0.217765 0.001273 76.0 273.0 0.001292 0.001267 0.703061 0.023005 0.067901 inf
5 5.0 199 0.190955 0.000726 38.0 161.0 0.000646 0.000747 0.623111 0.026810 0.079950 inf
6 6.0 147 0.170068 0.000536 25.0 122.0 0.000425 0.000566 0.560047 0.020887 0.063064 inf
7 7.0 71 0.281690 0.000259 20.0 51.0 0.000340 0.000237 0.890661 0.111622 0.330614 inf
8 8.0 45 0.266667 0.000164 12.0 33.0 0.000204 0.000153 0.846833 0.015023 0.043828 inf
9 9.0 53 0.169811 0.000193 9.0 44.0 0.000153 0.000204 0.559267 0.096855 0.287566 inf
10 10.0 29 0.241379 0.000106 7.0 22.0 0.000119 0.000102 0.772752 0.071568 0.213485 inf
11 11.0 15 0.333333 0.000055 5.0 10.0 0.000085 0.000046 1.040928 0.091954 0.268176 inf
12 12.0 17 0.176471 0.000062 3.0 14.0 0.000051 0.000065 0.579461 0.156863 0.461467 inf
13 13.0 14 0.214286 0.000051 3.0 11.0 0.000051 0.000051 0.692740 0.037815 0.113280 inf
14 14.0 12 0.500000 0.000044 6.0 6.0 0.000102 0.000028 1.539806 0.285714 0.847065 inf
15 15.0 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.300000 0.889607 inf
16 16.0 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040928 0.133333 0.390729 inf
17 17.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.333333 1.040928 inf
18 18.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 inf
19 20.0 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040928 0.333333 1.040928 inf
20 22.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.333333 1.040928 inf
21 24.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
22 26.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.500000 1.539806 inf
23 36.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.500000 inf inf
In [1285]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1286]:
# Categories: '0', '1-4', '>=5'
df_inputs_prepr['num_tl_90g_dpd_24m:0'] = np.where((df_inputs_prepr['num_tl_90g_dpd_24m'] == 0), 1, 0)
df_inputs_prepr['num_tl_90g_dpd_24m:1-4'] = np.where((df_inputs_prepr['num_tl_90g_dpd_24m'] >= 1) & (df_inputs_prepr['num_tl_90g_dpd_24m'] <= 4), 1, 0)
df_inputs_prepr['num_tl_90g_dpd_24m:>=5'] = np.where((df_inputs_prepr['num_tl_90g_dpd_24m'] >= 5), 1, 0)

Variable: 'revol_bal'¶

In [1287]:
# unique values
df_inputs_prepr['revol_bal'].unique()
Out[1287]:
array([ 11405.,  30808.,  16940., ...,  87095., 155670.,  34577.])
In [1288]:
# 'revol_bal'
df_inputs_prepr['revol_bal_factor'] = pd.cut(df_inputs_prepr['revol_bal'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

# 'revol_bal'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'revol_bal_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1288]:
revol_bal_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-2904.836, 58096.72] 267276 0.215605 0.974628 57626.0 209650.0 0.980001 0.973161 0.696655 NaN NaN 0.001207
1 (58096.72, 116193.44] 5128 0.177457 0.018699 910.0 4218.0 0.015476 0.019579 0.582445 0.038148 0.114210 0.001207
2 (116193.44, 174290.16] 1030 0.151456 0.003756 156.0 874.0 0.002653 0.004057 0.503154 0.026001 0.079291 0.001207
3 (174290.16, 232386.88] 402 0.131841 0.001466 53.0 349.0 0.000901 0.001620 0.442360 0.019616 0.060794 0.001207
4 (232386.88, 290483.6] 198 0.111111 0.000722 22.0 176.0 0.000374 0.000817 0.377039 0.020730 0.065322 0.001207
5 (290483.6, 348580.32] 86 0.139535 0.000314 12.0 74.0 0.000204 0.000343 0.466316 0.028424 0.089278 0.001207
6 (348580.32, 406677.04] 45 0.177778 0.000164 8.0 37.0 0.000136 0.000172 0.583415 0.038243 0.117099 0.001207
7 (406677.04, 464773.76] 29 0.275862 0.000106 8.0 21.0 0.000136 0.000097 0.873671 0.098084 0.290256 0.001207
8 (464773.76, 522870.48] 10 0.300000 0.000036 3.0 7.0 0.000051 0.000032 0.943965 0.024138 0.070293 0.001207
9 (522870.48, 580967.2] 11 0.181818 0.000040 2.0 9.0 0.000034 0.000042 0.595618 0.118182 0.348346 0.001207
10 (580967.2, 639063.92] 6 0.000000 0.000022 0.0 6.0 0.000000 0.000028 0.000000 0.181818 0.595618 0.001207
11 (639063.92, 697160.64] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.000000 0.000000 0.001207
12 (697160.64, 755257.36] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.001207
13 (755257.36, 813354.08] 3 0.666667 0.000011 2.0 1.0 0.000034 0.000005 2.119548 0.666667 2.119548 0.001207
14 (813354.08, 871450.8] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.666667 2.119548 0.001207
15 (871450.8, 929547.52] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.001207
16 (929547.52, 987644.24] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
17 (987644.24, 1045740.96] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
18 (1045740.96, 1103837.68] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
19 (1103837.68, 1161934.4] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
20 (1161934.4, 1220031.12] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
21 (1220031.12, 1278127.84] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
22 (1278127.84, 1336224.56] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
23 (1336224.56, 1394321.28] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
24 (1394321.28, 1452418.0] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
25 (1452418.0, 1510514.72] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
26 (1510514.72, 1568611.44] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
27 (1568611.44, 1626708.16] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
28 (1626708.16, 1684804.88] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
29 (1684804.88, 1742901.6] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
30 (1742901.6, 1800998.32] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
31 (1800998.32, 1859095.04] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
32 (1859095.04, 1917191.76] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
33 (1917191.76, 1975288.48] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
34 (1975288.48, 2033385.2] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
35 (2033385.2, 2091481.92] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
36 (2091481.92, 2149578.64] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
37 (2149578.64, 2207675.36] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
38 (2207675.36, 2265772.08] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
39 (2265772.08, 2323868.8] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
40 (2323868.8, 2381965.52] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
41 (2381965.52, 2440062.24] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
42 (2440062.24, 2498158.96] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
43 (2498158.96, 2556255.68] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
44 (2556255.68, 2614352.4] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 NaN NaN 0.001207
45 (2614352.4, 2672449.12] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
46 (2672449.12, 2730545.84] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
47 (2730545.84, 2788642.56] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
48 (2788642.56, 2846739.28] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.001207
49 (2846739.28, 2904836.0] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.001207
In [1289]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1290]:
# one category will be created for 'revol_bal' > 100000. 
#***********************************************************************************************
# the categories of everyone with 'revol_bal' less 100000.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['revol_bal'] <= 100000, : ]

#df_inputs_prepr_temp
df_inputs_prepr_temp['revol_bal_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'revol_bal_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3812317764.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['revol_bal_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1290]:
revol_bal_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-99.967, 1999.34] 16936 0.217820 0.062315 3689.0 13247.0 0.063122 0.062094 0.701395 NaN NaN 0.001701
1 (1999.34, 3998.68] 23741 0.212839 0.087354 5053.0 18688.0 0.086462 0.087598 0.686640 0.004981 0.014754 0.001701
2 (3998.68, 5998.02] 28965 0.211289 0.106575 6120.0 22845.0 0.104719 0.107084 0.682046 0.001549 0.004594 0.001701
3 (5998.02, 7997.36] 28485 0.215025 0.104809 6125.0 22360.0 0.104805 0.104810 0.693121 0.003736 0.011075 0.001701
4 (7997.36, 9996.7] 25966 0.219749 0.095541 5706.0 20260.0 0.097635 0.094967 0.707100 0.004723 0.013978 0.001701
5 (9996.7, 11996.04] 22836 0.222762 0.084024 5087.0 17749.0 0.087044 0.083197 0.716004 0.003013 0.008904 0.001701
6 (11996.04, 13995.38] 19075 0.220708 0.070185 4210.0 14865.0 0.072037 0.069678 0.709934 0.002055 0.006070 0.001701
7 (13995.38, 15994.72] 16227 0.225057 0.059706 3652.0 12575.0 0.062489 0.058944 0.722777 0.004349 0.012843 0.001701
8 (15994.72, 17994.06] 13461 0.224203 0.049529 3018.0 10443.0 0.051641 0.048950 0.720258 0.000854 0.002519 0.001701
9 (17994.06, 19993.4] 11389 0.219773 0.041905 2503.0 8886.0 0.042829 0.041652 0.707172 0.004430 0.013086 0.001701
10 (19993.4, 21992.74] 9359 0.213698 0.034436 2000.0 7359.0 0.034222 0.034495 0.689188 0.006075 0.017984 0.001701
11 (21992.74, 23992.08] 7979 0.212683 0.029358 1697.0 6282.0 0.029037 0.029446 0.686180 0.001015 0.003008 0.001701
12 (23992.08, 25991.42] 6614 0.217569 0.024336 1439.0 5175.0 0.024623 0.024257 0.700651 0.004885 0.014471 0.001701
13 (25991.42, 27990.76] 5734 0.206836 0.021098 1186.0 4548.0 0.020294 0.021318 0.668821 0.010732 0.031830 0.001701
14 (27990.76, 29990.1] 4853 0.208325 0.017856 1011.0 3842.0 0.017299 0.018009 0.673244 0.001488 0.004423 0.001701
15 (29990.1, 31989.44] 4215 0.207117 0.015509 873.0 3342.0 0.014938 0.015665 0.669657 0.001207 0.003588 0.001701
16 (31989.44, 33988.78] 3551 0.193467 0.013066 687.0 2864.0 0.011755 0.013425 0.628951 0.013651 0.040705 0.001701
17 (33988.78, 35988.12] 3121 0.210189 0.011484 656.0 2465.0 0.011225 0.011554 0.678780 0.016722 0.049829 0.001701
18 (35988.12, 37987.46] 2619 0.195494 0.009636 512.0 2107.0 0.008761 0.009876 0.635015 0.014695 0.043765 0.001701
19 (37987.46, 39986.8] 2186 0.208600 0.008043 456.0 1730.0 0.007803 0.008109 0.674062 0.013106 0.039047 0.001701
20 (39986.8, 41986.14] 1907 0.202412 0.007017 386.0 1521.0 0.006605 0.007130 0.655656 0.006188 0.018406 0.001701
21 (41986.14, 43985.48] 1626 0.178352 0.005983 290.0 1336.0 0.004962 0.006262 0.583546 0.024060 0.072110 0.001701
22 (43985.48, 45984.82] 1421 0.192118 0.005228 273.0 1148.0 0.004671 0.005381 0.624916 0.013766 0.041370 0.001701
23 (45984.82, 47984.16] 1257 0.171838 0.004625 216.0 1041.0 0.003696 0.004880 0.563856 0.020281 0.061059 0.001701
24 (47984.16, 49983.5] 1083 0.228994 0.003985 248.0 835.0 0.004244 0.003914 0.734384 0.057156 0.170528 0.001701
25 (49983.5, 51982.84] 815 0.218405 0.002999 178.0 637.0 0.003046 0.002986 0.703125 0.010589 0.031259 0.001701
26 (51982.84, 53982.18] 698 0.206304 0.002568 144.0 554.0 0.002464 0.002597 0.667238 0.012101 0.035887 0.001701
27 (53982.18, 55981.52] 608 0.177632 0.002237 108.0 500.0 0.001848 0.002344 0.581372 0.028672 0.085865 0.001701
28 (55981.52, 57980.86] 515 0.182524 0.001895 94.0 421.0 0.001608 0.001973 0.596118 0.004893 0.014745 0.001701
29 (57980.86, 59980.2] 455 0.184615 0.001674 84.0 371.0 0.001437 0.001739 0.602407 0.002091 0.006290 0.001701
30 (59980.2, 61979.54] 394 0.190355 0.001450 75.0 319.0 0.001283 0.001495 0.619635 0.005740 0.017228 0.001701
31 (61979.54, 63978.88] 379 0.195251 0.001395 74.0 305.0 0.001266 0.001430 0.634287 0.004895 0.014651 0.001701
32 (63978.88, 65978.22] 351 0.170940 0.001291 60.0 291.0 0.001027 0.001364 0.561137 0.024310 0.073149 0.001701
33 (65978.22, 67977.56] 296 0.199324 0.001089 59.0 237.0 0.001010 0.001111 0.646451 0.028384 0.085314 0.001701
34 (67977.56, 69976.9] 269 0.226766 0.000990 61.0 208.0 0.001044 0.000975 0.727817 0.027441 0.081366 0.001701
35 (69976.9, 71976.24] 261 0.176245 0.000960 46.0 215.0 0.000787 0.001008 0.577187 0.050521 0.150631 0.001701
36 (71976.24, 73975.58] 185 0.140541 0.000681 26.0 159.0 0.000445 0.000745 0.468080 0.035705 0.109107 0.001701
37 (73975.58, 75974.92] 251 0.187251 0.000924 47.0 204.0 0.000804 0.000956 0.610325 0.046710 0.142245 0.001701
38 (75974.92, 77974.26] 187 0.219251 0.000688 41.0 146.0 0.000702 0.000684 0.705628 0.032000 0.095304 0.001701
39 (77974.26, 79973.6] 187 0.155080 0.000688 29.0 158.0 0.000496 0.000741 0.512832 0.064171 0.192796 0.001701
40 (79973.6, 81972.94] 157 0.184713 0.000578 29.0 128.0 0.000496 0.000600 0.602702 0.029633 0.089870 0.001701
41 (81972.94, 83972.28] 183 0.213115 0.000673 39.0 144.0 0.000667 0.000675 0.687459 0.028401 0.084757 0.001701
42 (83972.28, 85971.62] 131 0.091603 0.000482 12.0 119.0 0.000205 0.000558 0.313430 0.121512 0.374029 0.001701
43 (85971.62, 87970.96] 172 0.174419 0.000633 30.0 142.0 0.000513 0.000666 0.571666 0.082816 0.258236 0.001701
44 (87970.96, 89970.3] 112 0.178571 0.000412 20.0 92.0 0.000342 0.000431 0.584208 0.004153 0.012542 0.001701
45 (89970.3, 91969.64] 131 0.213740 0.000482 28.0 103.0 0.000479 0.000483 0.689314 0.035169 0.105106 0.001701
46 (91969.64, 93968.98] 116 0.155172 0.000427 18.0 98.0 0.000308 0.000459 0.513114 0.058568 0.176199 0.001701
47 (93968.98, 95968.32] 105 0.171429 0.000386 18.0 87.0 0.000308 0.000408 0.562617 0.016256 0.049502 0.001701
48 (95968.32, 97967.66] 118 0.169492 0.000434 20.0 98.0 0.000342 0.000459 0.556746 0.001937 0.005871 0.001701
49 (97967.66, 99967.0] 98 0.091837 0.000361 9.0 89.0 0.000154 0.000417 0.314186 0.077655 0.242560 0.001701
In [1291]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1292]:
# Categories: '<= 8000', '8000-22000', '22000-35000', '35000-60000', '60000-100000', > 100000
df_inputs_prepr['revol_bal:<=8k'] = np.where((df_inputs_prepr['revol_bal'] <= 8000.), 1, 0)
df_inputs_prepr['revol_bal:8-22k'] = np.where((df_inputs_prepr['revol_bal'] > 8000.) & (df_inputs_prepr['revol_bal'] <= 22000.), 1, 0)
df_inputs_prepr['revol_bal:22-35k'] = np.where((df_inputs_prepr['revol_bal'] > 22000.) & (df_inputs_prepr['revol_bal'] <= 35000.), 1, 0)
df_inputs_prepr['revol_bal:35-60k'] = np.where((df_inputs_prepr['revol_bal'] > 35000.) & (df_inputs_prepr['revol_bal'] <= 60000.), 1, 0)
df_inputs_prepr['revol_bal:60-100k'] = np.where((df_inputs_prepr['revol_bal'] > 60000.) & (df_inputs_prepr['revol_bal'] <= 100000.), 1, 0)
df_inputs_prepr['revol_bal:>100k'] = np.where((df_inputs_prepr['revol_bal'] > 100000.), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3187895681.py:4: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['revol_bal:22-35k'] = np.where((df_inputs_prepr['revol_bal'] > 22000.) & (df_inputs_prepr['revol_bal'] <= 35000.), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3187895681.py:5: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['revol_bal:35-60k'] = np.where((df_inputs_prepr['revol_bal'] > 35000.) & (df_inputs_prepr['revol_bal'] <= 60000.), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3187895681.py:6: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['revol_bal:60-100k'] = np.where((df_inputs_prepr['revol_bal'] > 60000.) & (df_inputs_prepr['revol_bal'] <= 100000.), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3187895681.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['revol_bal:>100k'] = np.where((df_inputs_prepr['revol_bal'] > 100000.), 1, 0)
In [1293]:
df_inputs_prepr = df_inputs_prepr.copy()
In [1294]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['revol_bal_factor'])
# Drop the provisory feature

Variable: 'total_bal_il'¶

In [1295]:
# unique values
df_inputs_prepr['total_bal_il'].unique()
Out[1295]:
array([     0.,  61045.,   7321., ...,  72775., 108273.,    919.])
In [1296]:
# number of observations with 0 value
df_inputs_prepr['total_bal_il'].value_counts()[0]
Out[1296]:
173265
In [1297]:
# 'revol_bal'
df_inputs_prepr['total_bal_il_factor'] = pd.cut(df_inputs_prepr['total_bal_il'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

# 'revol_bal'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_bal_il_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1297]:
total_bal_il_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-1044.916, 20898.32] 212152 0.202082 0.773617 42872.0 169280.0 0.729091 0.785770 0.656415 NaN NaN 0.009611
1 (20898.32, 41796.64] 29386 0.261077 0.107157 7672.0 21714.0 0.130472 0.100793 0.830495 0.058995 0.174080 0.009611
2 (41796.64, 62694.96] 14531 0.259032 0.052988 3764.0 10767.0 0.064011 0.049979 0.824516 0.002044 0.005980 0.009611
3 (62694.96, 83593.28] 7465 0.255325 0.027221 1906.0 5559.0 0.032414 0.025804 0.813663 0.003708 0.010852 0.009611
4 (83593.28, 104491.6] 4013 0.238973 0.014633 959.0 3054.0 0.016309 0.014176 0.765677 0.016352 0.047986 0.009611
5 (104491.6, 125389.92] 2247 0.252781 0.008194 568.0 1679.0 0.009660 0.007794 0.806213 0.013808 0.040536 0.009611
6 (125389.92, 146288.24] 1410 0.248936 0.005142 351.0 1059.0 0.005969 0.004916 0.794940 0.003845 0.011273 0.009611
7 (146288.24, 167186.56] 930 0.252688 0.003391 235.0 695.0 0.003996 0.003226 0.805940 0.003752 0.011000 0.009611
8 (167186.56, 188084.88] 612 0.215686 0.002232 132.0 480.0 0.002245 0.002228 0.696897 0.037002 0.109043 0.009611
9 (188084.88, 208983.2] 411 0.243309 0.001499 100.0 311.0 0.001701 0.001444 0.778423 0.027623 0.081526 0.009611
10 (208983.2, 229881.52] 276 0.260870 0.001006 72.0 204.0 0.001224 0.000947 0.829889 0.017561 0.051467 0.009611
11 (229881.52, 250779.84] 179 0.268156 0.000653 48.0 131.0 0.000816 0.000608 0.851184 0.007287 0.021295 0.009611
12 (250779.84, 271678.16] 160 0.231250 0.000583 37.0 123.0 0.000629 0.000571 0.742929 0.036906 0.108255 0.009611
13 (271678.16, 292576.48] 100 0.180000 0.000365 18.0 82.0 0.000306 0.000381 0.590130 0.051250 0.152799 0.009611
14 (292576.48, 313474.8] 87 0.160920 0.000317 14.0 73.0 0.000238 0.000339 0.532171 0.019080 0.057959 0.009611
15 (313474.8, 334373.12] 54 0.129630 0.000197 7.0 47.0 0.000119 0.000218 0.435448 0.031290 0.096723 0.009611
16 (334373.12, 355271.44] 48 0.229167 0.000175 11.0 37.0 0.000187 0.000172 0.736783 0.099537 0.301335 0.009611
17 (355271.44, 376169.76] 33 0.212121 0.000120 7.0 26.0 0.000119 0.000121 0.686312 0.017045 0.050471 0.009611
18 (376169.76, 397068.08] 29 0.206897 0.000106 6.0 23.0 0.000102 0.000107 0.670771 0.005225 0.015542 0.009611
19 (397068.08, 417966.4] 15 0.200000 0.000055 3.0 12.0 0.000051 0.000056 0.650199 0.006897 0.020572 0.009611
20 (417966.4, 438864.72] 25 0.160000 0.000091 4.0 21.0 0.000068 0.000097 0.529360 0.040000 0.120839 0.009611
21 (438864.72, 459763.04] 11 0.272727 0.000040 3.0 8.0 0.000051 0.000037 0.864527 0.112727 0.335167 0.009611
22 (459763.04, 480661.36] 8 0.375000 0.000029 3.0 5.0 0.000051 0.000023 1.162592 0.102273 0.298065 0.009611
23 (480661.36, 501559.68] 10 0.100000 0.000036 1.0 9.0 0.000017 0.000042 0.341514 0.275000 0.821078 0.009611
24 (501559.68, 522458.0] 13 0.307692 0.000047 4.0 9.0 0.000068 0.000042 0.966339 0.207692 0.624825 0.009611
25 (522458.0, 543356.32] 7 0.428571 0.000026 3.0 4.0 0.000051 0.000019 1.321159 0.120879 0.354820 0.009611
26 (543356.32, 564254.64] 7 0.285714 0.000026 2.0 5.0 0.000034 0.000023 0.902384 0.142857 0.418775 0.009611
27 (564254.64, 585152.96] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.285714 0.902384 0.009611
28 (585152.96, 606051.28] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.009611
29 (606051.28, 626949.6] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.009611
30 (626949.6, 647847.92] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
31 (647847.92, 668746.24] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
32 (668746.24, 689644.56] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
33 (689644.56, 710542.88] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 NaN NaN 0.009611
34 (710542.88, 731441.2] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.009611
35 (731441.2, 752339.52] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
36 (752339.52, 773237.84] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
37 (773237.84, 794136.16] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.009611
38 (794136.16, 815034.48] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
39 (815034.48, 835932.8] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
40 (835932.8, 856831.12] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
41 (856831.12, 877729.44] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
42 (877729.44, 898627.76] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 NaN NaN 0.009611
43 (898627.76, 919526.08] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
44 (919526.08, 940424.4] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
45 (940424.4, 961322.72] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
46 (961322.72, 982221.04] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
47 (982221.04, 1003119.36] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
48 (1003119.36, 1024017.68] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.009611
49 (1024017.68, 1044916.0] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.009611
In [1298]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1299]:
# one category will be created for 'total_bal_il' = 0 with a count value = 696066.
# one other category will be created for 'total_bal_il' > 200000.
#***********************************************************************************************
# the categories of everyone with 'total_bal_il' different of 0 and less than 200000.
df_inputs_prepr_temp = df_inputs_prepr.loc[(df_inputs_prepr['total_bal_il'] != 0) & (df_inputs_prepr['total_bal_il'] <= 200000), : ]

#df_inputs_prepr_temp
df_inputs_prepr_temp['total_bal_il_factor'] = pd.cut(df_inputs_prepr_temp['total_bal_il'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'total_bal_il_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1452105147.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['total_bal_il_factor'] = pd.cut(df_inputs_prepr_temp['total_bal_il'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1299]:
total_bal_il_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-198.948, 3999.96] 5316 0.245109 0.053300 1303.0 4013.0 0.050998 0.054093 0.664122 NaN NaN 0.001671
1 (3999.96, 7998.92] 7607 0.245958 0.076271 1871.0 5736.0 0.073229 0.077318 0.666347 0.000849 0.002226 0.001671
2 (7998.92, 11997.88] 7994 0.260195 0.080151 2080.0 5914.0 0.081409 0.079717 0.703701 0.014237 0.037353 0.001671
3 (11997.88, 15996.84] 8308 0.258305 0.083299 2146.0 6162.0 0.083992 0.083060 0.698741 0.001890 0.004960 0.001671
4 (15996.84, 19995.8] 7995 0.259162 0.080161 2072.0 5923.0 0.081096 0.079839 0.700989 0.000857 0.002248 0.001671
5 (19995.8, 23994.76] 7196 0.259728 0.072150 1869.0 5327.0 0.073151 0.071805 0.702474 0.000566 0.001485 0.001671
6 (23994.76, 27993.72] 6462 0.250851 0.064790 1621.0 4841.0 0.063444 0.065254 0.679183 0.008876 0.023291 0.001671
7 (27993.72, 31992.68] 5914 0.263274 0.059296 1557.0 4357.0 0.060939 0.058730 0.711782 0.012422 0.032599 0.001671
8 (31992.68, 35991.64] 5061 0.265165 0.050743 1342.0 3719.0 0.052524 0.050130 0.716748 0.001891 0.004966 0.001671
9 (35991.64, 39990.6] 4527 0.261763 0.045389 1185.0 3342.0 0.046380 0.045048 0.707816 0.003402 0.008933 0.001671
10 (39990.6, 43989.56] 3971 0.270964 0.039815 1076.0 2895.0 0.042114 0.039023 0.731982 0.009202 0.024166 0.001671
11 (43989.56, 47988.52] 3394 0.266352 0.034029 904.0 2490.0 0.035382 0.033564 0.719866 0.004612 0.012115 0.001671
12 (47988.52, 51987.48] 2894 0.252937 0.029016 732.0 2162.0 0.028650 0.029143 0.684655 0.013415 0.035211 0.001671
13 (51987.48, 55986.44] 2548 0.263736 0.025547 672.0 1876.0 0.026301 0.025287 0.712997 0.010799 0.028342 0.001671
14 (55986.44, 59985.4] 2207 0.241504 0.022128 533.0 1674.0 0.020861 0.022565 0.654668 0.022232 0.058329 0.001671
15 (59985.4, 63984.36] 1995 0.260652 0.020003 520.0 1475.0 0.020352 0.019882 0.704899 0.019147 0.050231 0.001671
16 (63984.36, 67983.32] 1719 0.253054 0.017235 435.0 1284.0 0.017025 0.017308 0.684962 0.007598 0.019937 0.001671
17 (67983.32, 71982.28] 1582 0.256005 0.015862 405.0 1177.0 0.015851 0.015865 0.692705 0.002951 0.007743 0.001671
18 (71982.28, 75981.24] 1384 0.256503 0.013876 355.0 1029.0 0.013894 0.013870 0.694011 0.000498 0.001306 0.001671
19 (75981.24, 79980.2] 1206 0.252073 0.012092 304.0 902.0 0.011898 0.012158 0.682388 0.004430 0.011623 0.001671
20 (79980.2, 83979.16] 1094 0.269653 0.010969 295.0 799.0 0.011546 0.010770 0.728535 0.017580 0.046147 0.001671
21 (83979.16, 87978.12] 972 0.215021 0.009746 209.0 763.0 0.008180 0.010285 0.585200 0.054632 0.143335 0.001671
22 (87978.12, 91977.08] 857 0.252042 0.008593 216.0 641.0 0.008454 0.008640 0.682307 0.037021 0.097107 0.001671
23 (91977.08, 95976.04] 764 0.246073 0.007660 188.0 576.0 0.007358 0.007764 0.666651 0.005969 0.015656 0.001671
24 (95976.04, 99975.0] 661 0.210287 0.006627 139.0 522.0 0.005440 0.007036 0.572775 0.035786 0.093876 0.001671
25 (99975.0, 103973.96] 595 0.262185 0.005966 156.0 439.0 0.006106 0.005917 0.708924 0.051897 0.136149 0.001671
26 (103973.96, 107972.92] 507 0.230769 0.005083 117.0 390.0 0.004579 0.005257 0.626516 0.031416 0.082408 0.001671
27 (107972.92, 111971.88] 462 0.259740 0.004632 120.0 342.0 0.004697 0.004610 0.702507 0.028971 0.075991 0.001671
28 (111971.88, 115970.84] 460 0.252174 0.004612 116.0 344.0 0.004540 0.004637 0.682653 0.007566 0.019854 0.001671
29 (115970.84, 119969.8] 388 0.280928 0.003890 109.0 279.0 0.004266 0.003761 0.758177 0.028754 0.075524 0.001671
30 (119969.8, 123968.76] 363 0.250689 0.003640 91.0 272.0 0.003562 0.003666 0.678757 0.030239 0.079420 0.001671
31 (123968.76, 127967.72] 324 0.231481 0.003249 75.0 249.0 0.002935 0.003356 0.628384 0.019207 0.050373 0.001671
32 (127967.72, 131966.68] 320 0.250000 0.003208 80.0 240.0 0.003131 0.003235 0.676950 0.018519 0.048566 0.001671
33 (131966.68, 135965.64] 293 0.242321 0.002938 71.0 222.0 0.002779 0.002992 0.656809 0.007679 0.020141 0.001671
34 (135965.64, 139964.6] 241 0.286307 0.002416 69.0 172.0 0.002701 0.002318 0.772336 0.043986 0.115526 0.001671
35 (139964.6, 143963.56] 237 0.253165 0.002376 60.0 177.0 0.002348 0.002386 0.685252 0.033142 0.087084 0.001671
36 (143963.56, 147962.52] 217 0.253456 0.002176 55.0 162.0 0.002153 0.002184 0.686017 0.000292 0.000765 0.001671
37 (147962.52, 151961.48] 179 0.256983 0.001795 46.0 133.0 0.001800 0.001793 0.695271 0.003527 0.009254 0.001671
38 (151961.48, 155960.44] 189 0.243386 0.001895 46.0 143.0 0.001800 0.001928 0.659604 0.013597 0.035668 0.001671
39 (155960.44, 159959.4] 184 0.217391 0.001845 40.0 144.0 0.001566 0.001941 0.591422 0.025995 0.068181 0.001671
40 (159959.4, 163958.36] 169 0.254438 0.001694 43.0 126.0 0.001683 0.001698 0.688593 0.037047 0.097170 0.001671
41 (163958.36, 167957.32] 139 0.287770 0.001394 40.0 99.0 0.001566 0.001334 0.776188 0.033332 0.087595 0.001671
42 (167957.32, 171956.28] 128 0.195312 0.001283 25.0 103.0 0.000978 0.001388 0.533423 0.092457 0.242765 0.001671
43 (171956.28, 175955.24] 114 0.175439 0.001143 20.0 94.0 0.000783 0.001267 0.481059 0.019874 0.052363 0.001671
44 (175955.24, 179954.2] 124 0.217742 0.001243 27.0 97.0 0.001057 0.001308 0.592342 0.042303 0.111283 0.001671
45 (179954.2, 183953.16] 107 0.280374 0.001073 30.0 77.0 0.001174 0.001038 0.756719 0.062632 0.164377 0.001671
46 (183953.16, 187952.12] 110 0.200000 0.001103 22.0 88.0 0.000861 0.001186 0.545749 0.080374 0.210971 0.001671
47 (187952.12, 191951.08] 108 0.194444 0.001083 21.0 87.0 0.000822 0.001173 0.531139 0.005556 0.014609 0.001671
48 (191951.08, 195950.04] 77 0.311688 0.000772 24.0 53.0 0.000939 0.000714 0.839340 0.117244 0.308200 0.001671
49 (195950.04, 199949.0] 74 0.243243 0.000742 18.0 56.0 0.000705 0.000755 0.659229 0.068445 0.180111 0.001671
In [1300]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1301]:
# Categories: '= 0', '0 - 18000', '18000 - 30000', '30000 - 70000', '70000 - 200000', '> 200000'
df_inputs_prepr['total_bal_il:=0'] = np.where((df_inputs_prepr['total_bal_il'] == 0.), 1, 0)
df_inputs_prepr['total_bal_il:0-18k'] = np.where((df_inputs_prepr['total_bal_il'] > 0.) & (df_inputs_prepr['total_bal_il'] <= 18000.), 1, 0)
df_inputs_prepr['total_bal_il:18-30k'] = np.where((df_inputs_prepr['total_bal_il'] > 18000.) & (df_inputs_prepr['total_bal_il'] <= 30000.), 1, 0)
df_inputs_prepr['total_bal_il:30-70k'] = np.where((df_inputs_prepr['total_bal_il'] > 30000.) & (df_inputs_prepr['total_bal_il'] <= 70000.), 1, 0)
df_inputs_prepr['total_bal_il:70-200k'] = np.where((df_inputs_prepr['total_bal_il'] > 70000.) & (df_inputs_prepr['total_bal_il'] <= 200000.), 1, 0)
df_inputs_prepr['total_bal_il:>200k'] = np.where((df_inputs_prepr['total_bal_il'] > 200000.), 1, 0)
In [1302]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['total_bal_il_factor'])
# Drop the provisory feature

Variable: 'max_bal_bc'¶

In [1303]:
# unique values
df_inputs_prepr['max_bal_bc'].unique()
Out[1303]:
array([    0., 11140.,  3179., ..., 26142., 15475., 24378.])
In [1304]:
# number of observations with 0 value
df_inputs_prepr['max_bal_bc'].value_counts()[0]
Out[1304]:
164273
In [1305]:
# one category will be created for 'max_bal_bc' = 0 with a count value = 660128.
# one other category will be created for 'max_bal_bc' > 50000.
#********************************
# 'max_bal_bc'
# the categories of everyone with 'total_bal_il' different of 0.
df_inputs_prepr_temp = df_inputs_prepr.loc[(df_inputs_prepr['max_bal_bc'] != 0) & (df_inputs_prepr['max_bal_bc'] <= 50000), : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

#df_inputs_prepr_temp
df_inputs_prepr_temp['max_bal_bc_factor'] = pd.cut(df_inputs_prepr_temp['max_bal_bc'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'max_bal_bc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3384841133.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['max_bal_bc_factor'] = pd.cut(df_inputs_prepr_temp['max_bal_bc'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1305]:
max_bal_bc_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-48.997, 1000.94] 9830 0.278535 0.089429 2738.0 7092.0 0.097109 0.086780 0.750959 NaN NaN 0.0056
1 (1000.94, 2000.88] 13143 0.278095 0.119570 3655.0 9488.0 0.129633 0.116098 0.749802 0.000440 0.001157 0.0056
2 (2000.88, 3000.82] 14661 0.267103 0.133380 3916.0 10745.0 0.138890 0.131479 0.720940 0.010992 0.028862 0.0056
3 (3000.82, 4000.76] 13546 0.262734 0.123236 3559.0 9987.0 0.126228 0.122204 0.709478 0.004369 0.011462 0.0056
4 (4000.76, 5000.7] 12530 0.259936 0.113993 3257.0 9273.0 0.115517 0.113467 0.702139 0.002798 0.007339 0.0056
5 (5000.7, 6000.64] 9542 0.262104 0.086809 2501.0 7041.0 0.088704 0.086156 0.707825 0.002168 0.005687 0.0056
6 (6000.64, 7000.58] 7337 0.243287 0.066749 1785.0 5552.0 0.063309 0.067936 0.658501 0.018817 0.049324 0.0056
7 (7000.58, 8000.52] 5866 0.247017 0.053367 1449.0 4417.0 0.051392 0.054048 0.668272 0.003729 0.009772 0.0056
8 (8000.52, 9000.46] 4461 0.239632 0.040584 1069.0 3392.0 0.037915 0.041506 0.648924 0.007384 0.019349 0.0056
9 (9000.46, 10000.4] 3969 0.244646 0.036108 971.0 2998.0 0.034439 0.036684 0.662060 0.005014 0.013136 0.0056
10 (10000.4, 11000.34] 2489 0.233427 0.022644 581.0 1908.0 0.020606 0.023347 0.632666 0.011219 0.029394 0.0056
11 (11000.34, 12000.28] 2023 0.235788 0.018404 477.0 1546.0 0.016918 0.018917 0.638853 0.002361 0.006187 0.0056
12 (12000.28, 13000.22] 1607 0.212819 0.014620 342.0 1265.0 0.012130 0.015479 0.578653 0.022970 0.060200 0.0056
13 (13000.22, 14000.16] 1419 0.208598 0.012910 296.0 1123.0 0.010498 0.013741 0.567580 0.004221 0.011073 0.0056
14 (14000.16, 15000.1] 1316 0.219605 0.011972 289.0 1027.0 0.010250 0.012567 0.596445 0.011007 0.028865 0.0056
15 (15000.1, 16000.04] 1047 0.225406 0.009525 236.0 811.0 0.008370 0.009924 0.611649 0.005801 0.015204 0.0056
16 (16000.04, 16999.98] 789 0.212928 0.007178 168.0 621.0 0.005959 0.007599 0.578938 0.012478 0.032711 0.0056
17 (16999.98, 17999.92] 775 0.216774 0.007051 168.0 607.0 0.005959 0.007427 0.589024 0.003846 0.010086 0.0056
18 (17999.92, 18999.86] 602 0.215947 0.005477 130.0 472.0 0.004611 0.005776 0.586855 0.000827 0.002169 0.0056
19 (18999.86, 19999.8] 591 0.204738 0.005377 121.0 470.0 0.004292 0.005751 0.557452 0.011209 0.029403 0.0056
20 (19999.8, 20999.74] 398 0.233668 0.003621 93.0 305.0 0.003298 0.003732 0.633298 0.028931 0.075847 0.0056
21 (20999.74, 21999.68] 298 0.167785 0.002711 50.0 248.0 0.001773 0.003035 0.460194 0.065883 0.173105 0.0056
22 (21999.68, 22999.62] 269 0.189591 0.002447 51.0 218.0 0.001809 0.002668 0.517660 0.021806 0.057466 0.0056
23 (22999.62, 23999.56] 268 0.205224 0.002438 55.0 213.0 0.001951 0.002606 0.558728 0.015633 0.041068 0.0056
24 (23999.56, 24999.5] 261 0.203065 0.002374 53.0 208.0 0.001880 0.002545 0.553061 0.002159 0.005666 0.0056
25 (24999.5, 25999.44] 147 0.210884 0.001337 31.0 116.0 0.001099 0.001419 0.573579 0.007819 0.020517 0.0056
26 (25999.44, 26999.38] 116 0.215517 0.001055 25.0 91.0 0.000887 0.001114 0.585728 0.004633 0.012150 0.0056
27 (26999.38, 27999.32] 94 0.138298 0.000855 13.0 81.0 0.000461 0.000991 0.381989 0.077219 0.203739 0.0056
28 (27999.32, 28999.26] 70 0.171429 0.000637 12.0 58.0 0.000426 0.000710 0.469813 0.033131 0.087824 0.0056
29 (28999.26, 29999.2] 78 0.282051 0.000710 22.0 56.0 0.000780 0.000685 0.760202 0.110623 0.290388 0.0056
30 (29999.2, 30999.14] 52 0.153846 0.000473 8.0 44.0 0.000284 0.000538 0.423308 0.128205 0.336893 0.0056
31 (30999.14, 31999.08] 37 0.216216 0.000337 8.0 29.0 0.000284 0.000355 0.587561 0.062370 0.164253 0.0056
32 (31999.08, 32999.02] 26 0.115385 0.000237 3.0 23.0 0.000106 0.000281 0.320683 0.100832 0.266878 0.0056
33 (32999.02, 33998.96] 30 0.366667 0.000273 11.0 19.0 0.000390 0.000232 0.985106 0.251282 0.664423 0.0056
34 (33998.96, 34998.9] 49 0.244898 0.000446 12.0 37.0 0.000426 0.000453 0.662721 0.121769 0.322385 0.0056
35 (34998.9, 35998.84] 24 0.166667 0.000218 4.0 20.0 0.000142 0.000245 0.457239 0.078231 0.205482 0.0056
36 (35998.84, 36998.78] 13 0.307692 0.000118 4.0 9.0 0.000142 0.000110 0.827781 0.141026 0.370542 0.0056
37 (36998.78, 37998.72] 27 0.222222 0.000246 6.0 21.0 0.000213 0.000257 0.603305 0.085470 0.224476 0.0056
38 (37998.72, 38998.66] 17 0.294118 0.000155 5.0 12.0 0.000177 0.000147 0.791960 0.071895 0.188655 0.0056
39 (38998.66, 39998.6] 15 0.133333 0.000136 2.0 13.0 0.000071 0.000159 0.368751 0.160784 0.423209 0.0056
40 (39998.6, 40998.54] 13 0.153846 0.000118 2.0 11.0 0.000071 0.000135 0.423308 0.020513 0.054557 0.0056
41 (40998.54, 41998.48] 3 0.000000 0.000027 0.0 3.0 0.000000 0.000037 0.000000 0.153846 0.423308 0.0056
42 (41998.48, 42998.42] 11 0.272727 0.000100 3.0 8.0 0.000106 0.000098 0.735703 0.272727 0.735703 0.0056
43 (42998.42, 43998.36] 9 0.000000 0.000082 0.0 9.0 0.000000 0.000110 0.000000 0.272727 0.735703 0.0056
44 (43998.36, 44998.3] 8 0.375000 0.000073 3.0 5.0 0.000106 0.000061 1.007636 0.375000 1.007636 0.0056
45 (44998.3, 45998.24] 8 0.125000 0.000073 1.0 7.0 0.000035 0.000086 0.346476 0.250000 0.661160 0.0056
46 (45998.24, 46998.18] 6 0.333333 0.000055 2.0 4.0 0.000071 0.000049 0.895788 0.208333 0.549312 0.0056
47 (46998.18, 47998.12] 10 0.300000 0.000091 3.0 7.0 0.000106 0.000086 0.807469 0.033333 0.088318 0.0056
48 (47998.12, 48998.06] 5 0.400000 0.000045 2.0 3.0 0.000071 0.000037 1.075805 0.100000 0.268336 0.0056
49 (48998.06, 49998.0] 14 0.214286 0.000127 3.0 11.0 0.000106 0.000135 0.582499 0.185714 0.493306 0.0056
In [1306]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1307]:
# Categories: '= 0', '0 - 8000', '8000 - 16000', '16000 - 26000', '26000 - 50000', '> 50000'
df_inputs_prepr['max_bal_bc:=0'] = np.where((df_inputs_prepr['max_bal_bc'] == 0.), 1, 0)
df_inputs_prepr['max_bal_bc:0-8k'] = np.where((df_inputs_prepr['max_bal_bc'] > 0.) & (df_inputs_prepr['max_bal_bc'] <= 8000.), 1, 0)
df_inputs_prepr['max_bal_bc:8-16k'] = np.where((df_inputs_prepr['max_bal_bc'] > 8000.) & (df_inputs_prepr['max_bal_bc'] <= 16000.), 1, 0)
df_inputs_prepr['max_bal_bc:16-26k'] = np.where((df_inputs_prepr['max_bal_bc'] > 16000.) & (df_inputs_prepr['max_bal_bc'] <= 26000.), 1, 0)
df_inputs_prepr['max_bal_bc:26-50k'] = np.where((df_inputs_prepr['max_bal_bc'] > 26000.) & (df_inputs_prepr['max_bal_bc'] <= 50000.), 1, 0)
df_inputs_prepr['max_bal_bc:>50k'] = np.where((df_inputs_prepr['max_bal_bc'] > 50000.), 1, 0)

Variable: 'avg_cur_bal'¶

In [1308]:
# unique values
df_inputs_prepr['avg_cur_bal'].unique()
Out[1308]:
array([ 4658.,  7654.,  7645., ..., 58635., 49593., 61961.])
In [1309]:
# one other category will be created for 'avg_cur_bal' > 100000.
#********************************
# 'avg_cur_bal'
# the categories of everyone with 'avg_cur_bal' different of 0.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['avg_cur_bal'] <= 100000, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

#df_inputs_prepr_temp
df_inputs_prepr_temp['avg_cur_bal_factor'] = pd.cut(df_inputs_prepr_temp['avg_cur_bal'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'avg_cur_bal_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\712759651.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['avg_cur_bal_factor'] = pd.cut(df_inputs_prepr_temp['avg_cur_bal'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1309]:
avg_cur_bal_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-99.959, 1999.18] 35692 0.246610 0.130586 8802.0 26890.0 0.149990 0.125281 0.787195 NaN NaN 0.022469
1 (1999.18, 3998.36] 50295 0.251735 0.184014 12661.0 37634.0 0.215749 0.175338 0.802214 0.005125 0.015019 0.022469
2 (3998.36, 5997.54] 30578 0.248970 0.111876 7613.0 22965.0 0.129729 0.106995 0.794114 0.002765 0.008101 0.022469
3 (5997.54, 7996.72] 31977 0.203771 0.116994 6516.0 25461.0 0.111035 0.118624 0.660640 0.045198 0.133473 0.022469
4 (7996.72, 9995.9] 13877 0.224616 0.050772 3117.0 10760.0 0.053115 0.050131 0.722473 0.020845 0.061833 0.022469
5 (9995.9, 11995.08] 12279 0.217037 0.044925 2665.0 9614.0 0.045413 0.044792 0.700053 0.007579 0.022420 0.022469
6 (11995.08, 13994.26] 11237 0.203613 0.041113 2288.0 8949.0 0.038988 0.041694 0.660168 0.013424 0.039885 0.022469
7 (13994.26, 15993.44] 10383 0.192237 0.037988 1996.0 8387.0 0.034013 0.039075 0.626174 0.011376 0.033995 0.022469
8 (15993.44, 17992.62] 9522 0.198173 0.034838 1887.0 7635.0 0.032155 0.035572 0.643934 0.005935 0.017761 0.022469
9 (17992.62, 19991.8] 8585 0.181712 0.031410 1560.0 7025.0 0.026583 0.032730 0.594542 0.016460 0.049393 0.022469
10 (19991.8, 21990.98] 7630 0.190170 0.027916 1451.0 6179.0 0.024726 0.028788 0.619976 0.008458 0.025434 0.022469
11 (21990.98, 23990.16] 6668 0.172016 0.024396 1147.0 5521.0 0.019545 0.025722 0.565231 0.018155 0.054745 0.022469
12 (23990.16, 25989.34] 5819 0.177006 0.021290 1030.0 4789.0 0.017552 0.022312 0.580338 0.004991 0.015107 0.022469
13 (25989.34, 27988.52] 5157 0.173744 0.018868 896.0 4261.0 0.015268 0.019852 0.570469 0.003262 0.009869 0.022469
14 (27988.52, 29987.7] 4416 0.162817 0.016157 719.0 3697.0 0.012252 0.017224 0.537264 0.010927 0.033205 0.022469
15 (29987.7, 31986.88] 3822 0.158817 0.013984 607.0 3215.0 0.010344 0.014979 0.525052 0.004000 0.012213 0.022469
16 (31986.88, 33986.06] 3328 0.156550 0.012176 521.0 2807.0 0.008878 0.013078 0.518115 0.002267 0.006937 0.022469
17 (33986.06, 35985.24] 2845 0.159930 0.010409 455.0 2390.0 0.007753 0.011135 0.528451 0.003379 0.010336 0.022469
18 (35985.24, 37984.42] 2445 0.161963 0.008946 396.0 2049.0 0.006748 0.009546 0.534660 0.002033 0.006209 0.022469
19 (37984.42, 39983.6] 2161 0.160574 0.007906 347.0 1814.0 0.005913 0.008451 0.530419 0.001389 0.004241 0.022469
20 (39983.6, 41982.78] 1800 0.162778 0.006586 293.0 1507.0 0.004993 0.007021 0.537145 0.002204 0.006726 0.022469
21 (41982.78, 43981.96] 1538 0.141743 0.005627 218.0 1320.0 0.003715 0.006150 0.472527 0.021035 0.064618 0.022469
22 (43981.96, 45981.14] 1417 0.143260 0.005184 203.0 1214.0 0.003459 0.005656 0.477223 0.001518 0.004696 0.022469
23 (45981.14, 47980.32] 1246 0.154896 0.004559 193.0 1053.0 0.003289 0.004906 0.513044 0.011635 0.035822 0.022469
24 (47980.32, 49979.5] 1020 0.139216 0.003732 142.0 878.0 0.002420 0.004091 0.464697 0.015680 0.048347 0.022469
25 (49979.5, 51978.68] 919 0.126224 0.003362 116.0 803.0 0.001977 0.003741 0.424193 0.012992 0.040504 0.022469
26 (51978.68, 53977.86] 781 0.134443 0.002857 105.0 676.0 0.001789 0.003150 0.449867 0.008219 0.025674 0.022469
27 (53977.86, 55977.04] 698 0.136103 0.002554 95.0 603.0 0.001619 0.002809 0.455032 0.001660 0.005165 0.022469
28 (55977.04, 57976.22] 594 0.134680 0.002173 80.0 514.0 0.001363 0.002395 0.450605 0.001423 0.004427 0.022469
29 (57976.22, 59975.4] 542 0.134686 0.001983 73.0 469.0 0.001244 0.002185 0.450624 0.000006 0.000019 0.022469
30 (59975.4, 61974.58] 499 0.128257 0.001826 64.0 435.0 0.001091 0.002027 0.430558 0.006430 0.020066 0.022469
31 (61974.58, 63973.76] 441 0.111111 0.001613 49.0 392.0 0.000835 0.001826 0.376509 0.017145 0.054049 0.022469
32 (63973.76, 65972.94] 367 0.128065 0.001343 47.0 320.0 0.000801 0.001491 0.429960 0.016954 0.053451 0.022469
33 (65972.94, 67972.12] 317 0.138801 0.001160 44.0 273.0 0.000750 0.001272 0.463412 0.010736 0.033452 0.022469
34 (67972.12, 69971.3] 281 0.135231 0.001028 38.0 243.0 0.000648 0.001132 0.452320 0.003570 0.011092 0.022469
35 (69971.3, 71970.48] 249 0.136546 0.000911 34.0 215.0 0.000579 0.001002 0.456409 0.001315 0.004089 0.022469
36 (71970.48, 73969.66] 247 0.097166 0.000904 24.0 223.0 0.000409 0.001039 0.331914 0.039380 0.124495 0.022469
37 (73969.66, 75968.84] 219 0.155251 0.000801 34.0 185.0 0.000579 0.000862 0.514134 0.058085 0.182220 0.022469
38 (75968.84, 77968.02] 190 0.110526 0.000695 21.0 169.0 0.000358 0.000787 0.374650 0.044725 0.139484 0.022469
39 (77968.02, 79967.2] 176 0.142045 0.000644 25.0 151.0 0.000426 0.000704 0.473465 0.031519 0.098814 0.022469
40 (79967.2, 81966.38] 180 0.111111 0.000659 20.0 160.0 0.000341 0.000745 0.376509 0.030934 0.096956 0.022469
41 (81966.38, 83965.56] 143 0.090909 0.000523 13.0 130.0 0.000222 0.000606 0.311704 0.020202 0.064805 0.022469
42 (83965.56, 85964.74] 114 0.096491 0.000417 11.0 103.0 0.000187 0.000480 0.329741 0.005582 0.018036 0.022469
43 (85964.74, 87963.92] 128 0.132812 0.000468 17.0 111.0 0.000290 0.000517 0.444787 0.036321 0.115047 0.022469
44 (87963.92, 89963.1] 109 0.110092 0.000399 12.0 97.0 0.000204 0.000452 0.373269 0.022721 0.071518 0.022469
45 (89963.1, 91962.28] 102 0.078431 0.000373 8.0 94.0 0.000136 0.000438 0.271001 0.031660 0.102267 0.022469
46 (91962.28, 93961.46] 92 0.065217 0.000337 6.0 86.0 0.000102 0.000401 0.227275 0.013214 0.043727 0.022469
47 (93961.46, 95960.64] 66 0.151515 0.000241 10.0 56.0 0.000170 0.000261 0.502668 0.086298 0.275393 0.022469
48 (95960.64, 97959.82] 66 0.136364 0.000241 9.0 57.0 0.000153 0.000266 0.455842 0.015152 0.046826 0.022469
49 (97959.82, 99959.0] 64 0.093750 0.000234 6.0 58.0 0.000102 0.000270 0.320896 0.042614 0.134946 0.022469
In [1310]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1311]:
# Categories: '< 7000', '7000 - 15000', '15000 - 30000', '30000 - 50000', '50000 - 100000', '> 100000'
df_inputs_prepr['avg_cur_bal:0-7k'] = np.where((df_inputs_prepr['avg_cur_bal'] >= 0.) & (df_inputs_prepr['avg_cur_bal'] <= 7000.), 1, 0)
df_inputs_prepr['avg_cur_bal:7-15k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 7000.) & (df_inputs_prepr['avg_cur_bal'] <= 15000.), 1, 0)
df_inputs_prepr['avg_cur_bal:15-30k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 15000.) & (df_inputs_prepr['avg_cur_bal'] <= 30000.), 1, 0)
df_inputs_prepr['avg_cur_bal:30-50k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 30000.) & (df_inputs_prepr['avg_cur_bal'] <= 50000.), 1, 0)
df_inputs_prepr['avg_cur_bal:50-100k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 50000.) & (df_inputs_prepr['avg_cur_bal'] <= 100000.), 1, 0)
df_inputs_prepr['avg_cur_bal:>100k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 100000.), 1, 0)

Variable: 'bc_open_to_buy'¶

In [1312]:
# unique values
df_inputs_prepr['bc_open_to_buy'].unique()
Out[1312]:
array([ 1221., 19625.,   207., ..., 37116., 37369., 32213.])
In [1313]:
# one other category will be created for 'bc_open_to_buy' > 100000.
#********************************
# 'bc_open_to_buy'
# the categories of everyone with 'bc_open_to_buy' different of 0.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['bc_open_to_buy'] <= 100000, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

#df_inputs_prepr_temp
df_inputs_prepr_temp['bc_open_to_buy_factor'] = pd.cut(df_inputs_prepr_temp['bc_open_to_buy'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'bc_open_to_buy_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\284082808.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['bc_open_to_buy_factor'] = pd.cut(df_inputs_prepr_temp['bc_open_to_buy'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1313]:
bc_open_to_buy_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-99.997, 1999.94] 79292 0.248865 0.290178 19733.0 59559.0 0.336115 0.277607 0.793336 NaN NaN 0.020203
1 (1999.94, 3999.88] 40816 0.237505 0.149371 9694.0 31122.0 0.165119 0.145061 0.759999 0.011360 0.033336 0.020203
2 (3999.88, 5999.82] 40135 0.210166 0.146879 8435.0 31700.0 0.143675 0.147755 0.679243 0.027339 0.080757 0.020203
3 (5999.82, 7999.76] 19726 0.215300 0.072190 4247.0 15479.0 0.072340 0.072148 0.694473 0.005134 0.015231 0.020203
4 (7999.76, 9999.7] 14927 0.202184 0.054627 3018.0 11909.0 0.051406 0.055508 0.655495 0.013116 0.038979 0.020203
5 (9999.7, 11999.64] 11763 0.206920 0.043048 2434.0 9329.0 0.041459 0.043483 0.669596 0.004736 0.014101 0.020203
6 (11999.64, 13999.58] 9316 0.197617 0.034093 1841.0 7475.0 0.031358 0.034841 0.641867 0.009303 0.027729 0.020203
7 (13999.58, 15999.52] 7759 0.185591 0.028395 1440.0 6319.0 0.024528 0.029453 0.605829 0.012026 0.036037 0.020203
8 (15999.52, 17999.46] 6360 0.183176 0.023275 1165.0 5195.0 0.019844 0.024214 0.598565 0.002415 0.007264 0.020203
9 (17999.46, 19999.4] 5465 0.185910 0.020000 1016.0 4449.0 0.017306 0.020737 0.606789 0.002734 0.008224 0.020203
10 (19999.4, 21999.34] 4618 0.172152 0.016900 795.0 3823.0 0.013541 0.017819 0.565275 0.013758 0.041514 0.020203
11 (21999.34, 23999.28] 3823 0.171593 0.013991 656.0 3167.0 0.011174 0.014762 0.563580 0.000559 0.001695 0.020203
12 (23999.28, 25999.22] 3527 0.176354 0.012907 622.0 2905.0 0.010595 0.013540 0.577988 0.004761 0.014409 0.020203
13 (25999.22, 27999.16] 2843 0.168484 0.010404 479.0 2364.0 0.008159 0.011019 0.554148 0.007870 0.023841 0.020203
14 (27999.16, 29999.1] 2488 0.162379 0.009105 404.0 2084.0 0.006881 0.009714 0.535573 0.006105 0.018574 0.020203
15 (29999.1, 31999.04] 2306 0.150043 0.008439 346.0 1960.0 0.005893 0.009136 0.497805 0.012336 0.037768 0.020203
16 (31999.04, 33998.98] 1986 0.146022 0.007268 290.0 1696.0 0.004940 0.007905 0.485423 0.004021 0.012383 0.020203
17 (33998.98, 35998.92] 1682 0.140309 0.006155 236.0 1446.0 0.004020 0.006740 0.467766 0.005713 0.017656 0.020203
18 (35998.92, 37998.86] 1521 0.132150 0.005566 201.0 1320.0 0.003424 0.006153 0.442414 0.008159 0.025352 0.020203
19 (37998.86, 39998.8] 1411 0.138909 0.005164 196.0 1215.0 0.003339 0.005663 0.463426 0.006759 0.021012 0.020203
20 (39998.8, 41998.74] 1213 0.136026 0.004439 165.0 1048.0 0.002810 0.004885 0.454479 0.002882 0.008947 0.020203
21 (41998.74, 43998.68] 1061 0.136664 0.003883 145.0 916.0 0.002470 0.004270 0.456459 0.000637 0.001980 0.020203
22 (43998.68, 45998.62] 948 0.165612 0.003469 157.0 791.0 0.002674 0.003687 0.545418 0.028948 0.088959 0.020203
23 (45998.62, 47998.56] 843 0.134045 0.003085 113.0 730.0 0.001925 0.003403 0.448317 0.031567 0.097100 0.020203
24 (47998.56, 49998.5] 726 0.119835 0.002657 87.0 639.0 0.001482 0.002978 0.403825 0.014210 0.044492 0.020203
25 (49998.5, 51998.44] 702 0.136752 0.002569 96.0 606.0 0.001635 0.002825 0.456734 0.016917 0.052909 0.020203
26 (51998.44, 53998.38] 608 0.139803 0.002225 85.0 523.0 0.001448 0.002438 0.466197 0.003050 0.009463 0.020203
27 (53998.38, 55998.32] 547 0.126143 0.002002 69.0 478.0 0.001175 0.002228 0.423641 0.013660 0.042557 0.020203
28 (55998.32, 57998.26] 521 0.099808 0.001907 52.0 469.0 0.000886 0.002186 0.340162 0.026335 0.083479 0.020203
29 (57998.26, 59998.2] 485 0.107216 0.001775 52.0 433.0 0.000886 0.002018 0.363852 0.007408 0.023690 0.020203
30 (59998.2, 61998.14] 404 0.150990 0.001478 61.0 343.0 0.001039 0.001599 0.500715 0.043774 0.136864 0.020203
31 (61998.14, 63998.08] 367 0.100817 0.001343 37.0 330.0 0.000630 0.001538 0.343399 0.050173 0.157316 0.020203
32 (63998.08, 65998.02] 319 0.119122 0.001167 38.0 281.0 0.000647 0.001310 0.401580 0.018305 0.058181 0.020203
33 (65998.02, 67997.96] 307 0.110749 0.001124 34.0 273.0 0.000579 0.001272 0.375090 0.008373 0.026491 0.020203
34 (67997.96, 69997.9] 256 0.117188 0.000937 30.0 226.0 0.000511 0.001053 0.395477 0.006438 0.020387 0.020203
35 (69997.9, 71997.84] 243 0.135802 0.000889 33.0 210.0 0.000562 0.000979 0.453783 0.018615 0.058306 0.020203
36 (71997.84, 73997.78] 246 0.126016 0.000900 31.0 215.0 0.000528 0.001002 0.423245 0.009786 0.030539 0.020203
37 (73997.78, 75997.72] 194 0.092784 0.000710 18.0 176.0 0.000307 0.000820 0.317538 0.033233 0.105707 0.020203
38 (75997.72, 77997.66] 205 0.097561 0.000750 20.0 185.0 0.000341 0.000862 0.332942 0.004777 0.015404 0.020203
39 (77997.66, 79997.6] 183 0.092896 0.000670 17.0 166.0 0.000290 0.000774 0.317902 0.004665 0.015040 0.020203
40 (79997.6, 81997.54] 161 0.136646 0.000589 22.0 139.0 0.000375 0.000648 0.456404 0.043750 0.138502 0.020203
41 (81997.54, 83997.48] 144 0.097222 0.000527 14.0 130.0 0.000238 0.000606 0.331852 0.039424 0.124552 0.020203
42 (83997.48, 85997.42] 140 0.128571 0.000512 18.0 122.0 0.000307 0.000569 0.431242 0.031349 0.099390 0.020203
43 (85997.42, 87997.36] 100 0.060000 0.000366 6.0 94.0 0.000102 0.000438 0.209659 0.068571 0.221583 0.020203
44 (87997.36, 89997.3] 130 0.123077 0.000476 16.0 114.0 0.000273 0.000531 0.414024 0.063077 0.204365 0.020203
45 (89997.3, 91997.24] 111 0.099099 0.000406 11.0 100.0 0.000187 0.000466 0.337885 0.023978 0.076138 0.020203
46 (91997.24, 93997.18] 95 0.105263 0.000348 10.0 85.0 0.000170 0.000396 0.357622 0.006164 0.019737 0.020203
47 (93997.18, 95997.12] 82 0.085366 0.000300 7.0 75.0 0.000119 0.000350 0.293471 0.019897 0.064151 0.020203
48 (95997.12, 97997.06] 73 0.095890 0.000267 7.0 66.0 0.000119 0.000308 0.327564 0.010525 0.034093 0.020203
49 (97997.06, 99997.0] 75 0.133333 0.000274 10.0 65.0 0.000170 0.000303 0.446101 0.037443 0.118537 0.020203
In [1314]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1315]:
# Categories: '< 5000', '5000 - 15000', '15000 - 30000', '30000 - 45000', '45000 - 100000', '> 100000'
df_inputs_prepr['bc_open_to_buy:0-5k'] = np.where((df_inputs_prepr['bc_open_to_buy'] >= 0.) & (df_inputs_prepr['bc_open_to_buy'] <= 5000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:5-15k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 5000.) & (df_inputs_prepr['bc_open_to_buy'] <= 15000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:15-30k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 15000.) & (df_inputs_prepr['bc_open_to_buy'] <= 30000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:30-50k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 30000.) & (df_inputs_prepr['bc_open_to_buy'] <= 45000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:50-100k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 45000.) & (df_inputs_prepr['bc_open_to_buy'] <= 100000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:>100k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 100000.), 1, 0)
In [ ]:
 

Variable: 'revol_bal_to_bc_limit'¶

In [1316]:
# unique values
df_inputs_prepr['revol_bal_to_bc_limit'].nunique()
Out[1316]:
238525
In [1317]:
# 'revol_bal_to_bc_limit'
df_inputs_prepr['revol_bal_to_bc_limit_factor'] = pd.cut(df_inputs_prepr['revol_bal_to_bc_limit'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

# 'revol_bal_to_bc_limit'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'revol_bal_to_bc_limit_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1317]:
revol_bal_to_bc_limit_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.957, 19.135] 270872 0.214216 0.998231 58025.0 212847.0 0.998125 0.998260 0.693080 NaN NaN 0.00009
1 (19.135, 38.269] 354 0.225989 0.001305 80.0 274.0 0.001376 0.001285 0.727964 0.011773 0.034885 0.00009
2 (38.269, 57.404] 65 0.292308 0.000240 19.0 46.0 0.000327 0.000216 0.922241 0.066319 0.194276 0.00009
3 (57.404, 76.538] 25 0.240000 0.000092 6.0 19.0 0.000103 0.000089 0.769284 0.052308 0.152957 0.00009
4 (76.538, 95.673] 13 0.153846 0.000048 2.0 11.0 0.000034 0.000052 0.510938 0.086154 0.258346 0.00009
5 (95.673, 114.807] 8 0.125000 0.000029 1.0 7.0 0.000017 0.000033 0.421310 0.028846 0.089628 0.00009
6 (114.807, 133.942] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.125000 0.421310 0.00009
7 (133.942, 153.076] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.000000 0.000000 0.00009
8 (153.076, 172.211] 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.540666 0.500000 1.540666 0.00009
9 (172.211, 191.345] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.540666 0.00009
10 (191.345, 210.48] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.00009
11 (210.48, 229.614] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.00009
12 (229.614, 248.749] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.00009
13 (248.749, 267.883] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
14 (267.883, 287.018] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
15 (287.018, 306.153] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.00009
16 (306.153, 325.287] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
17 (325.287, 344.422] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
18 (344.422, 363.556] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
19 (363.556, 382.691] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
20 (382.691, 401.825] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
21 (401.825, 420.96] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
22 (420.96, 440.094] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.00009
23 (440.094, 459.229] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
24 (459.229, 478.363] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
25 (478.363, 497.498] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
26 (497.498, 516.632] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
27 (516.632, 535.767] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
28 (535.767, 554.901] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
29 (554.901, 574.036] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
30 (574.036, 593.171] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
31 (593.171, 612.305] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
32 (612.305, 631.44] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
33 (631.44, 650.574] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
34 (650.574, 669.709] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
35 (669.709, 688.843] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.00009
36 (688.843, 707.978] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
37 (707.978, 727.112] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
38 (727.112, 746.247] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
39 (746.247, 765.381] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
40 (765.381, 784.516] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
41 (784.516, 803.65] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
42 (803.65, 822.785] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
43 (822.785, 841.919] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
44 (841.919, 861.054] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
45 (861.054, 880.189] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
46 (880.189, 899.323] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
47 (899.323, 918.458] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
48 (918.458, 937.592] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.00009
49 (937.592, 956.727] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.00009
In [1318]:
# one other category will be created for 'revol_bal_to_bc_limit' value > 10.
#********************************
# 'revol_bal_to_bc_limit'
# the categories of everyone with 'revol_bal_to_bc_limit' less or equal 10.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['revol_bal_to_bc_limit'] <= 10, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['revol_bal_to_bc_limit_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal_to_bc_limit'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'revol_bal_to_bc_limit_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1955018008.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['revol_bal_to_bc_limit_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal_to_bc_limit'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1318]:
revol_bal_to_bc_limit_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.01, 0.2] 20300 0.158276 0.075207 3213.0 17087.0 0.055614 0.080543 0.525020 NaN NaN 0.0144
1 (0.2, 0.4] 30809 0.173910 0.114141 5358.0 25451.0 0.092742 0.119968 0.572706 0.015634 0.047686 0.0144
2 (0.4, 0.6] 40014 0.197531 0.148243 7904.0 32110.0 0.136811 0.151357 0.643905 0.023621 0.071199 0.0144
3 (0.6, 0.8] 44351 0.212126 0.164311 9408.0 34943.0 0.162844 0.164710 0.687466 0.014595 0.043561 0.0144
4 (0.8, 1.0] 48756 0.224342 0.180631 10938.0 37818.0 0.189327 0.178262 0.723711 0.012216 0.036245 0.0144
5 (1.0, 1.2] 28863 0.241694 0.106931 6976.0 21887.0 0.120748 0.103169 0.774911 0.017352 0.051201 0.0144
6 (1.2, 1.4] 15821 0.246950 0.058613 3907.0 11914.0 0.067627 0.056159 0.790366 0.005257 0.015455 0.0144
7 (1.4, 1.6] 9904 0.247274 0.036692 2449.0 7455.0 0.042390 0.035141 0.791317 0.000324 0.000951 0.0144
8 (1.6, 1.8] 6701 0.255932 0.024826 1715.0 4986.0 0.029685 0.023502 0.816720 0.008658 0.025404 0.0144
9 (1.8, 2.0] 4674 0.241335 0.017316 1128.0 3546.0 0.019525 0.016715 0.773857 0.014597 0.042864 0.0144
10 (2.0, 2.2] 3362 0.237359 0.012455 798.0 2564.0 0.013813 0.012086 0.762149 0.003976 0.011708 0.0144
11 (2.2, 2.4] 2546 0.260408 0.009432 663.0 1883.0 0.011476 0.008876 0.829833 0.023050 0.067685 0.0144
12 (2.4, 2.6] 1939 0.242909 0.007184 471.0 1468.0 0.008153 0.006920 0.778486 0.017500 0.051347 0.0144
13 (2.6, 2.8] 1580 0.235443 0.005854 372.0 1208.0 0.006439 0.005694 0.756503 0.007466 0.021984 0.0144
14 (2.8, 3.0] 1329 0.253574 0.004924 337.0 992.0 0.005833 0.004676 0.809808 0.018131 0.053305 0.0144
15 (3.0, 3.2] 1068 0.230337 0.003957 246.0 822.0 0.004258 0.003875 0.741436 0.023237 0.068371 0.0144
16 (3.2, 3.4] 903 0.241417 0.003345 218.0 685.0 0.003773 0.003229 0.774099 0.011080 0.032663 0.0144
17 (3.4, 3.6] 781 0.247119 0.002893 193.0 588.0 0.003341 0.002772 0.790862 0.005702 0.016763 0.0144
18 (3.6, 3.8] 670 0.232836 0.002482 156.0 514.0 0.002700 0.002423 0.748813 0.014283 0.042049 0.0144
19 (3.8, 4.0] 545 0.260550 0.002019 142.0 403.0 0.002458 0.001900 0.830249 0.027715 0.081436 0.0144
20 (4.0, 4.2] 491 0.254582 0.001819 125.0 366.0 0.002164 0.001725 0.812765 0.005968 0.017484 0.0144
21 (4.2, 4.4] 442 0.212670 0.001638 94.0 348.0 0.001627 0.001640 0.689083 0.041913 0.123682 0.0144
22 (4.4, 4.6] 416 0.194712 0.001541 81.0 335.0 0.001402 0.001579 0.635454 0.017958 0.053628 0.0144
23 (4.6, 4.8] 363 0.258953 0.001345 94.0 269.0 0.001627 0.001268 0.825572 0.064242 0.190117 0.0144
24 (4.8, 5.0] 285 0.266667 0.001056 76.0 209.0 0.001315 0.000985 0.848144 0.007713 0.022572 0.0144
25 (5.0, 5.2] 262 0.217557 0.000971 57.0 205.0 0.000987 0.000966 0.703603 0.049109 0.144540 0.0144
26 (5.2, 5.4] 237 0.164557 0.000878 39.0 198.0 0.000675 0.000933 0.544236 0.053000 0.159367 0.0144
27 (5.4, 5.6] 239 0.271967 0.000885 65.0 174.0 0.001125 0.000820 0.863632 0.107410 0.319396 0.0144
28 (5.6, 5.8] 201 0.179104 0.000745 36.0 165.0 0.000623 0.000778 0.588445 0.092862 0.275188 0.0144
29 (5.8, 6.0] 201 0.253731 0.000745 51.0 150.0 0.000883 0.000707 0.810269 0.074627 0.221824 0.0144
30 (6.0, 6.2] 163 0.251534 0.000604 41.0 122.0 0.000710 0.000575 0.803823 0.002198 0.006446 0.0144
31 (6.2, 6.4] 146 0.321918 0.000541 47.0 99.0 0.000814 0.000467 1.009168 0.070384 0.205345 0.0144
32 (6.4, 6.599] 143 0.265734 0.000530 38.0 105.0 0.000658 0.000495 0.845417 0.056184 0.163751 0.0144
33 (6.599, 6.799] 146 0.273973 0.000541 40.0 106.0 0.000692 0.000500 0.869491 0.008238 0.024074 0.0144
34 (6.799, 6.999] 125 0.200000 0.000463 25.0 100.0 0.000433 0.000471 0.651295 0.073973 0.218196 0.0144
35 (6.999, 7.199] 134 0.276119 0.000496 37.0 97.0 0.000640 0.000457 0.875759 0.076119 0.224463 0.0144
36 (7.199, 7.399] 120 0.200000 0.000445 24.0 96.0 0.000415 0.000453 0.651295 0.076119 0.224463 0.0144
37 (7.399, 7.599] 84 0.297619 0.000311 25.0 59.0 0.000433 0.000278 0.938433 0.097619 0.287137 0.0144
38 (7.599, 7.799] 104 0.230769 0.000385 24.0 80.0 0.000415 0.000377 0.742713 0.066850 0.195720 0.0144
39 (7.799, 7.999] 106 0.216981 0.000393 23.0 83.0 0.000398 0.000391 0.701893 0.013788 0.040819 0.0144
40 (7.999, 8.199] 80 0.212500 0.000296 17.0 63.0 0.000294 0.000297 0.688578 0.004481 0.013315 0.0144
41 (8.199, 8.399] 71 0.295775 0.000263 21.0 50.0 0.000363 0.000236 0.933061 0.083275 0.244483 0.0144
42 (8.399, 8.599] 63 0.190476 0.000233 12.0 51.0 0.000208 0.000240 0.622737 0.105298 0.310325 0.0144
43 (8.599, 8.799] 67 0.238806 0.000248 16.0 51.0 0.000277 0.000240 0.766412 0.048330 0.143675 0.0144
44 (8.799, 8.999] 56 0.214286 0.000207 12.0 44.0 0.000208 0.000207 0.693887 0.024520 0.072524 0.0144
45 (8.999, 9.199] 72 0.194444 0.000267 14.0 58.0 0.000242 0.000273 0.634653 0.019841 0.059234 0.0144
46 (9.199, 9.399] 48 0.333333 0.000178 16.0 32.0 0.000277 0.000151 1.042412 0.138889 0.407758 0.0144
47 (9.399, 9.599] 55 0.254545 0.000204 14.0 41.0 0.000242 0.000193 0.812656 0.078788 0.229756 0.0144
48 (9.599, 9.799] 43 0.209302 0.000159 9.0 34.0 0.000156 0.000160 0.679061 0.045243 0.133595 0.0144
49 (9.799, 9.999] 42 0.190476 0.000156 8.0 34.0 0.000138 0.000160 0.622737 0.018826 0.056324 0.0144
In [1319]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1320]:
# Categories: '< 0.6', '0.6 - 1.2', '1.2 - 3.6', '3.6- 5.5', '5.5 - 10.', '> 10.'
df_inputs_prepr['revol_bal_to_bc_limit:0-0.6'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] <= 0.6), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:0.6-1.2'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 0.6) & (df_inputs_prepr['revol_bal_to_bc_limit'] <= 1.2), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:1.2-3.6'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 1.2) & (df_inputs_prepr['revol_bal_to_bc_limit'] <= 3.4), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:3.6-5.5'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 3.4) & (df_inputs_prepr['revol_bal_to_bc_limit'] <= 5.5), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:5.5-10.'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 5.5) & (df_inputs_prepr['revol_bal_to_bc_limit'] <= 10.), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:>10.'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 10.), 1, 0)
In [1321]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['revol_bal_to_bc_limit_factor'])
# Drop the provisory feature

Variable: 'revol_bal_to_open_to_buy'¶

In [1322]:
# unique values
df_inputs_prepr['revol_bal_to_open_to_buy'].nunique() 
Out[1322]:
260618
In [1323]:
# 'revol_bal_to_bc_limit'
df_inputs_prepr['revol_bal_to_open_to_buy_factor'] = pd.cut(df_inputs_prepr['revol_bal_to_open_to_buy'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.

# 'revol_bal_to_bc_limit'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'revol_bal_to_open_to_buy_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1323]:
revol_bal_to_open_to_buy_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-36.021, 720.42] 269193 0.213092 0.997410 57363.0 211830.0 0.996681 0.997608 0.692683 NaN NaN inf
1 (720.42, 1440.84] 377 0.262599 0.001397 99.0 278.0 0.001720 0.001309 0.838909 0.049507 0.146226 inf
2 (1440.84, 2161.26] 120 0.308333 0.000445 37.0 83.0 0.000643 0.000391 0.972542 0.045734 0.133633 inf
3 (2161.26, 2881.68] 53 0.320755 0.000196 17.0 36.0 0.000295 0.000170 1.008761 0.012421 0.036219 inf
4 (2881.68, 3602.1] 35 0.200000 0.000130 7.0 28.0 0.000122 0.000132 0.653544 0.120755 0.355217 inf
5 (3602.1, 4322.52] 25 0.280000 0.000093 7.0 18.0 0.000122 0.000085 0.889846 0.080000 0.236302 inf
6 (4322.52, 5042.94] 20 0.250000 0.000074 5.0 15.0 0.000087 0.000071 0.801907 0.030000 0.087939 inf
7 (5042.94, 5763.36] 13 0.384615 0.000048 5.0 8.0 0.000087 0.000038 1.195696 0.134615 0.393788 inf
8 (5763.36, 6483.78] 7 0.142857 0.000026 1.0 6.0 0.000017 0.000028 0.479270 0.241758 0.716426 inf
9 (6483.78, 7204.2] 6 0.166667 0.000022 1.0 5.0 0.000017 0.000024 0.552663 0.023810 0.073393 inf
10 (7204.2, 7924.62] 7 0.285714 0.000026 2.0 5.0 0.000035 0.000024 0.906543 0.119048 0.353880 inf
11 (7924.62, 8645.04] 5 0.400000 0.000019 2.0 3.0 0.000035 0.000014 1.241147 0.114286 0.334605 inf
12 (8645.04, 9365.46] 5 0.000000 0.000019 0.0 5.0 0.000000 0.000024 0.000000 0.400000 1.241147 inf
13 (9365.46, 10085.88] 4 0.500000 0.000015 2.0 2.0 0.000035 0.000009 1.545298 0.500000 1.545298 inf
14 (10085.88, 10806.3] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.545298 inf
15 (10806.3, 11526.72] 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.045452 0.333333 1.045452 inf
16 (11526.72, 12247.14] 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.545298 0.166667 0.499846 inf
17 (12247.14, 12967.56] 5 0.200000 0.000019 1.0 4.0 0.000017 0.000019 0.653544 0.300000 0.891754 inf
18 (12967.56, 13687.98] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.200000 0.653544 inf
19 (13687.98, 14408.4] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
20 (14408.4, 15128.82] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
21 (15128.82, 15849.24] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
22 (15849.24, 16569.66] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
23 (16569.66, 17290.08] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
24 (17290.08, 18010.5] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
25 (18010.5, 18730.92] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
26 (18730.92, 19451.34] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
27 (19451.34, 20171.76] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
28 (20171.76, 20892.18] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
29 (20892.18, 21612.6] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
30 (21612.6, 22333.02] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN inf
31 (22333.02, 23053.44] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
32 (23053.44, 23773.86] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
33 (23773.86, 24494.28] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
34 (24494.28, 25214.7] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
35 (25214.7, 25935.12] 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf NaN NaN inf
36 (25935.12, 26655.54] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
37 (26655.54, 27375.96] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
38 (27375.96, 28096.38] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
39 (28096.38, 28816.8] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
40 (28816.8, 29537.22] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
41 (29537.22, 30257.64] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
42 (30257.64, 30978.06] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
43 (30978.06, 31698.48] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
44 (31698.48, 32418.9] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
45 (32418.9, 33139.32] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
46 (33139.32, 33859.74] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
47 (33859.74, 34580.16] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
48 (34580.16, 35300.58] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN inf
49 (35300.58, 36021.0] 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.545298 NaN NaN inf
In [1324]:
# one other category will be created for 'revol_bal_to_open_to_buy' value > 100.
#********************************
# 'revol_bal_to_open_to_buy'
# the categories of everyone with 'revol_bal_to_open_to_buy' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['revol_bal_to_open_to_buy'] <= 100, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['revol_bal_to_open_to_buy_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal_to_open_to_buy'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'revol_bal_to_open_to_buy_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3664876170.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['revol_bal_to_open_to_buy_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal_to_open_to_buy'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1324]:
revol_bal_to_open_to_buy_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.1, 1.999] 125639 0.190275 0.477479 23906.0 101733.0 0.428246 0.490736 0.627361 NaN NaN 0.009295
1 (1.999, 3.999] 47240 0.222671 0.179531 10519.0 36721.0 0.188435 0.177133 0.724550 0.032396 0.097189 0.009295
2 (3.999, 5.998] 22821 0.229745 0.086729 5243.0 17578.0 0.093922 0.084792 0.745584 0.007073 0.021034 0.009295
3 (5.998, 7.998] 13508 0.232381 0.051336 3139.0 10369.0 0.056231 0.050018 0.753409 0.002636 0.007825 0.009295
4 (7.998, 9.997] 8727 0.231924 0.033166 2024.0 6703.0 0.036257 0.032334 0.752054 0.000457 0.001356 0.009295
5 (9.997, 11.996] 6368 0.235867 0.024201 1502.0 4866.0 0.026906 0.023472 0.763746 0.003943 0.011692 0.009295
6 (11.996, 13.996] 4904 0.232667 0.018637 1141.0 3763.0 0.020440 0.018152 0.754259 0.003200 0.009487 0.009295
7 (13.996, 15.995] 3858 0.245464 0.014662 947.0 2911.0 0.016964 0.014042 0.792140 0.012797 0.037880 0.009295
8 (15.995, 17.995] 3160 0.240190 0.012009 759.0 2401.0 0.013597 0.011582 0.776547 0.005274 0.015593 0.009295
9 (17.995, 19.994] 2683 0.242639 0.010196 651.0 2032.0 0.011662 0.009802 0.783790 0.002449 0.007244 0.009295
10 (19.994, 21.993] 2228 0.230700 0.008467 514.0 1714.0 0.009208 0.008268 0.748422 0.011939 0.035369 0.009295
11 (21.993, 23.993] 1993 0.233818 0.007574 466.0 1527.0 0.008348 0.007366 0.757673 0.003118 0.009252 0.009295
12 (23.993, 25.992] 1719 0.248400 0.006533 427.0 1292.0 0.007649 0.006232 0.800810 0.014582 0.043136 0.009295
13 (25.992, 27.992] 1513 0.235294 0.005750 356.0 1157.0 0.006377 0.005581 0.762049 0.013106 0.038761 0.009295
14 (27.992, 29.991] 1402 0.233951 0.005328 328.0 1074.0 0.005876 0.005181 0.758068 0.001343 0.003980 0.009295
15 (29.991, 31.99] 1218 0.270936 0.004629 330.0 888.0 0.005912 0.004284 0.867131 0.036984 0.109063 0.009295
16 (31.99, 33.99] 1142 0.243433 0.004340 278.0 864.0 0.004980 0.004168 0.786137 0.027503 0.080994 0.009295
17 (33.99, 35.989] 972 0.232510 0.003694 226.0 746.0 0.004049 0.003599 0.753794 0.010922 0.032343 0.009295
18 (35.989, 37.989] 906 0.250552 0.003443 227.0 679.0 0.004066 0.003275 0.807158 0.018042 0.053365 0.009295
19 (37.989, 39.988] 848 0.247642 0.003223 210.0 638.0 0.003762 0.003078 0.798570 0.002910 0.008588 0.009295
20 (39.988, 41.987] 737 0.226594 0.002801 167.0 570.0 0.002992 0.002750 0.736223 0.021047 0.062347 0.009295
21 (41.987, 43.987] 743 0.298789 0.002824 222.0 521.0 0.003977 0.002513 0.948719 0.072194 0.212496 0.009295
22 (43.987, 45.986] 623 0.243981 0.002368 152.0 471.0 0.002723 0.002272 0.787757 0.054808 0.160962 0.009295
23 (45.986, 47.986] 613 0.275693 0.002330 169.0 444.0 0.003027 0.002142 0.881090 0.031713 0.093333 0.009295
24 (47.986, 49.985] 540 0.233333 0.002052 126.0 414.0 0.002257 0.001997 0.756235 0.042360 0.124855 0.009295
25 (49.985, 51.984] 520 0.232692 0.001976 121.0 399.0 0.002168 0.001925 0.754334 0.000641 0.001901 0.009295
26 (51.984, 53.984] 503 0.246521 0.001912 124.0 379.0 0.002221 0.001828 0.795261 0.013829 0.040928 0.009295
27 (53.984, 55.983] 467 0.256959 0.001775 120.0 347.0 0.002150 0.001674 0.826042 0.010438 0.030780 0.009295
28 (55.983, 57.983] 439 0.277904 0.001668 122.0 317.0 0.002185 0.001529 0.887573 0.020945 0.061532 0.009295
29 (57.983, 59.982] 398 0.261307 0.001513 104.0 294.0 0.001863 0.001418 0.838836 0.016598 0.048738 0.009295
30 (59.982, 61.981] 343 0.204082 0.001304 70.0 273.0 0.001254 0.001317 0.668966 0.057225 0.169870 0.009295
31 (61.981, 63.981] 344 0.220930 0.001307 76.0 268.0 0.001361 0.001293 0.719363 0.016849 0.050397 0.009295
32 (63.981, 65.98] 325 0.261538 0.001235 85.0 240.0 0.001523 0.001158 0.839518 0.040608 0.120155 0.009295
33 (65.98, 67.98] 308 0.266234 0.001171 82.0 226.0 0.001469 0.001090 0.853321 0.004695 0.013803 0.009295
34 (67.98, 69.979] 303 0.277228 0.001152 84.0 219.0 0.001505 0.001056 0.885589 0.010994 0.032268 0.009295
35 (69.979, 71.978] 273 0.223443 0.001038 61.0 212.0 0.001093 0.001023 0.726848 0.053784 0.158742 0.009295
36 (71.978, 73.978] 255 0.254902 0.000969 65.0 190.0 0.001164 0.000917 0.819982 0.031459 0.093134 0.009295
37 (73.978, 75.977] 251 0.243028 0.000954 61.0 190.0 0.001093 0.000917 0.784941 0.011874 0.035041 0.009295
38 (75.977, 77.977] 220 0.236364 0.000836 52.0 168.0 0.000932 0.000810 0.765218 0.006664 0.019723 0.009295
39 (77.977, 79.976] 240 0.295833 0.000912 71.0 169.0 0.001272 0.000815 0.940074 0.059470 0.174857 0.009295
40 (79.976, 81.975] 219 0.228311 0.000832 50.0 169.0 0.000896 0.000815 0.741324 0.067523 0.198750 0.009295
41 (81.975, 83.975] 206 0.281553 0.000783 58.0 148.0 0.001039 0.000714 0.898269 0.053243 0.156945 0.009295
42 (83.975, 85.974] 212 0.221698 0.000806 47.0 165.0 0.000842 0.000796 0.721651 0.059855 0.176618 0.009295
43 (85.974, 87.974] 217 0.317972 0.000825 69.0 148.0 0.001236 0.000714 1.004801 0.096274 0.283150 0.009295
44 (87.974, 89.973] 177 0.265537 0.000673 47.0 130.0 0.000842 0.000627 0.851273 0.052436 0.153528 0.009295
45 (89.973, 91.972] 185 0.302703 0.000703 56.0 129.0 0.001003 0.000622 0.960165 0.037166 0.108892 0.009295
46 (91.972, 93.972] 154 0.266234 0.000585 41.0 113.0 0.000734 0.000545 0.853321 0.036469 0.106844 0.009295
47 (93.972, 95.971] 156 0.275641 0.000593 43.0 113.0 0.000770 0.000545 0.880936 0.009407 0.027615 0.009295
48 (95.971, 97.971] 147 0.265306 0.000559 39.0 108.0 0.000699 0.000521 0.850595 0.010335 0.030341 0.009295
49 (97.971, 99.97] 163 0.282209 0.000619 46.0 117.0 0.000824 0.000564 0.900189 0.016902 0.049593 0.009295
In [1325]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1326]:
# Categories: '< 2', '2 - 4', '4 - 20', '20 - 100', '> 100.'
df_inputs_prepr['revol_bal_to_open_to_buy:0-2'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] <= 2.), 1, 0)
df_inputs_prepr['revol_bal_to_open_to_buy:2-4'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] > 2.) & (df_inputs_prepr['revol_bal_to_open_to_buy'] <= 4.), 1, 0)
df_inputs_prepr['revol_bal_to_open_to_buy:4-20'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] > 4.) & (df_inputs_prepr['revol_bal_to_open_to_buy'] <= 20.), 1, 0)
df_inputs_prepr['revol_bal_to_open_to_buy:20-100'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] > 20.) & (df_inputs_prepr['revol_bal_to_open_to_buy'] <= 100.), 1, 0)
df_inputs_prepr['revol_bal_to_open_to_buy:>100'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] > 100.), 1, 0)
In [1327]:
df_inputs_prepr = df_inputs_prepr.drop(columns = ['revol_bal_to_open_to_buy_factor'])
# Drop the provisory feature

Variable: 'total_bal_ex_mort_to_inc'¶

In [1328]:
# unique values
df_inputs_prepr['total_bal_ex_mort_to_inc'].max() 
Out[1328]:
102819.0
In [1329]:
# one other category will be created for 'total_bal_ex_mort_to_inc' value > 100.
#********************************
# 'total_bal_ex_mort_to_inc'
# the categories of everyone with 'total_bal_ex_mort_to_inc' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 10, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['total_bal_ex_mort_to_inc_factor'] = pd.cut(df_inputs_prepr_temp['total_bal_ex_mort_to_inc'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'total_bal_ex_mort_to_inc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3588619369.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['total_bal_ex_mort_to_inc_factor'] = pd.cut(df_inputs_prepr_temp['total_bal_ex_mort_to_inc'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1329]:
total_bal_ex_mort_to_inc_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.00994, 0.199] 27084 0.178260 0.098807 4828.0 22256.0 0.082145 0.103355 0.584885 NaN NaN 0.009508
1 (0.199, 0.398] 56628 0.188052 0.206589 10649.0 45979.0 0.181186 0.213523 0.614403 0.009792 2.951776e-02 0.009508
2 (0.398, 0.597] 60626 0.209201 0.221175 12683.0 47943.0 0.215793 0.222644 0.677642 0.021149 6.323896e-02 0.009508
3 (0.597, 0.795] 47141 0.232791 0.171979 10974.0 36167.0 0.186715 0.167957 0.747486 0.023590 6.984450e-02 0.009508
4 (0.795, 0.994] 30836 0.239558 0.112495 7387.0 23449.0 0.125685 0.108895 0.767410 0.006767 1.992332e-02 0.009508
5 (0.994, 1.193] 17928 0.239904 0.065405 4301.0 13627.0 0.073179 0.063283 0.768428 0.000346 1.018757e-03 0.009508
6 (1.193, 1.392] 10527 0.232165 0.038404 2444.0 8083.0 0.041583 0.037537 0.745641 0.007739 2.278773e-02 0.009508
7 (1.392, 1.591] 6491 0.233862 0.023680 1518.0 4973.0 0.025828 0.023094 0.750643 0.001697 5.002816e-03 0.009508
8 (1.591, 1.79] 4210 0.224703 0.015359 946.0 3264.0 0.016096 0.015158 0.723612 0.009159 2.703133e-02 0.009508
9 (1.79, 1.988] 2990 0.233779 0.010908 699.0 2291.0 0.011893 0.010639 0.750399 0.009076 2.678674e-02 0.009508
10 (1.988, 2.187] 2190 0.233333 0.007990 511.0 1679.0 0.008694 0.007797 0.749085 0.000446 1.314098e-03 0.009508
11 (2.187, 2.386] 1577 0.239062 0.005753 377.0 1200.0 0.006414 0.005573 0.765950 0.005728 1.686548e-02 0.009508
12 (2.386, 2.585] 1217 0.221857 0.004440 270.0 947.0 0.004594 0.004398 0.715194 0.017204 5.075619e-02 0.009508
13 (2.585, 2.784] 973 0.256937 0.003550 250.0 723.0 0.004254 0.003358 0.818399 0.035080 1.032047e-01 0.009508
14 (2.784, 2.983] 686 0.240525 0.002503 165.0 521.0 0.002807 0.002419 0.770254 0.016413 4.814512e-02 0.009508
15 (2.983, 3.181] 572 0.260490 0.002087 149.0 423.0 0.002535 0.001964 0.828793 0.019965 5.853888e-02 0.009508
16 (3.181, 3.38] 464 0.284483 0.001693 132.0 332.0 0.002246 0.001542 0.898812 0.023993 7.001976e-02 0.009508
17 (3.38, 3.579] 317 0.261830 0.001156 83.0 234.0 0.001412 0.001087 0.832712 0.022653 6.610065e-02 0.009508
18 (3.579, 3.778] 262 0.263359 0.000956 69.0 193.0 0.001174 0.000896 0.837182 0.001529 4.470404e-03 0.009508
19 (3.778, 3.977] 221 0.294118 0.000806 65.0 156.0 0.001106 0.000724 0.926865 0.030759 8.968256e-02 0.009508
20 (3.977, 4.176] 177 0.248588 0.000646 44.0 133.0 0.000749 0.000618 0.793932 0.045530 1.329325e-01 0.009508
21 (4.176, 4.375] 135 0.274074 0.000493 37.0 98.0 0.000630 0.000455 0.868471 0.025487 7.453876e-02 0.009508
22 (4.375, 4.573] 127 0.228346 0.000463 29.0 98.0 0.000493 0.000455 0.734375 0.045728 1.340955e-01 0.009508
23 (4.573, 4.772] 104 0.269231 0.000379 28.0 76.0 0.000476 0.000353 0.854336 0.040884 1.199606e-01 0.009508
24 (4.772, 4.971] 95 0.084211 0.000347 8.0 87.0 0.000136 0.000404 0.290353 0.185020 5.639830e-01 0.009508
25 (4.971, 5.17] 71 0.211268 0.000259 15.0 56.0 0.000255 0.000260 0.683788 0.127057 3.934354e-01 0.009508
26 (5.17, 5.369] 48 0.250000 0.000175 12.0 36.0 0.000204 0.000167 0.798075 0.038732 1.142863e-01 0.009508
27 (5.369, 5.568] 46 0.239130 0.000168 11.0 35.0 0.000187 0.000163 0.766153 0.010870 3.192155e-02 0.009508
28 (5.568, 5.766] 50 0.260000 0.000182 13.0 37.0 0.000221 0.000172 0.827361 0.020870 6.120768e-02 0.009508
29 (5.766, 5.965] 44 0.250000 0.000161 11.0 33.0 0.000187 0.000153 0.798075 0.010000 2.928614e-02 0.009508
30 (5.965, 6.164] 33 0.303030 0.000120 10.0 23.0 0.000170 0.000107 0.952795 0.053030 1.547208e-01 0.009508
31 (6.164, 6.363] 31 0.322581 0.000113 10.0 21.0 0.000170 0.000098 1.009656 0.019550 5.686078e-02 0.009508
32 (6.363, 6.562] 33 0.303030 0.000120 10.0 23.0 0.000170 0.000107 0.952795 0.019550 5.686078e-02 0.009508
33 (6.562, 6.761] 21 0.190476 0.000077 4.0 17.0 0.000068 0.000079 0.621687 0.112554 3.311088e-01 0.009508
34 (6.761, 6.959] 22 0.409091 0.000080 9.0 13.0 0.000153 0.000060 1.263127 0.218615 6.414405e-01 0.009508
35 (6.959, 7.158] 14 0.214286 0.000051 3.0 11.0 0.000051 0.000051 0.692753 0.194805 5.703736e-01 0.009508
36 (7.158, 7.357] 14 0.214286 0.000051 3.0 11.0 0.000051 0.000051 0.692753 0.000000 0.000000e+00 0.009508
37 (7.357, 7.556] 17 0.058824 0.000062 1.0 16.0 0.000017 0.000074 0.206190 0.155462 4.865638e-01 0.009508
38 (7.556, 7.755] 10 0.100000 0.000036 1.0 9.0 0.000017 0.000042 0.341521 0.041176 1.353317e-01 0.009508
39 (7.755, 7.954] 16 0.187500 0.000058 3.0 13.0 0.000051 0.000060 0.612744 0.087500 2.712222e-01 0.009508
40 (7.954, 8.153] 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040944 0.145833 4.282008e-01 0.009508
41 (8.153, 8.351] 9 0.333333 0.000033 3.0 6.0 0.000051 0.000028 1.040944 0.000000 2.220446e-16 0.009508
42 (8.351, 8.55] 10 0.100000 0.000036 1.0 9.0 0.000017 0.000042 0.341521 0.233333 6.994230e-01 0.009508
43 (8.55, 8.749] 6 0.333333 0.000022 2.0 4.0 0.000034 0.000019 1.040944 0.233333 6.994230e-01 0.009508
44 (8.749, 8.948] 4 0.250000 0.000015 1.0 3.0 0.000017 0.000014 0.798075 0.083333 2.428697e-01 0.009508
45 (8.948, 9.147] 8 0.000000 0.000029 0.0 8.0 0.000000 0.000037 0.000000 0.250000 7.980746e-01 0.009508
46 (9.147, 9.346] 8 0.375000 0.000029 3.0 5.0 0.000051 0.000023 1.162609 0.375000 1.162609e+00 0.009508
47 (9.346, 9.544] 6 0.166667 0.000022 1.0 5.0 0.000017 0.000023 0.549713 0.208333 6.128962e-01 0.009508
48 (9.544, 9.743] 4 0.000000 0.000015 0.0 4.0 0.000000 0.000019 0.000000 0.166667 5.497132e-01 0.009508
49 (9.743, 9.942] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.000000 0.000000e+00 0.009508
In [1330]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1331]:
# Categories: '< 0.4', '0.4 - 1', '1 - 2.6', '2.6 - 4.4', '> 4.4'
df_inputs_prepr['total_bal_ex_mort_to_inc:0-0.4'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 0.4), 1, 0)
df_inputs_prepr['total_bal_ex_mort_to_inc:0.4-1'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] > 0.4) & (df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 1.), 1, 0)
df_inputs_prepr['total_bal_ex_mort_to_inc:1-2.6'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] > 1.) & (df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 2.6), 1, 0)
df_inputs_prepr['total_bal_ex_mort_to_inc:2.6-4.4'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] > 2.6) & (df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 4.4), 1, 0)
df_inputs_prepr['total_bal_ex_mort_to_inc:>4.4'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] > 4.4), 1, 0)

Variable: 'total_balance_to_credit_ratio'¶

In [1332]:
# unique values
df_inputs_prepr['total_balance_to_credit_ratio'].nunique() 
Out[1332]:
259985
In [1333]:
# 'total_balance_to_credit_ratio'
# the categories of everyone with 'total_balance_to_credit_ratio' less or equal 2.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['total_balance_to_credit_ratio'] <= 2., : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['total_balance_to_credit_ratio_factor'] = pd.cut(df_inputs_prepr_temp['total_balance_to_credit_ratio'], 40)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'total_balance_to_credit_ratio_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1020104122.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['total_balance_to_credit_ratio_factor'] = pd.cut(df_inputs_prepr_temp['total_balance_to_credit_ratio'], 40)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1333]:
total_balance_to_credit_ratio_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.00197, 0.0491] 629 0.243243 0.002294 153.0 476.0 0.002602 0.002210 0.778251 NaN NaN inf
1 (0.0491, 0.0983] 725 0.177931 0.002644 129.0 596.0 0.002194 0.002767 0.583896 0.065312 0.194355 inf
2 (0.0983, 0.147] 1004 0.184263 0.003661 185.0 819.0 0.003147 0.003802 0.603007 0.006332 0.019111 inf
3 (0.147, 0.197] 1623 0.175601 0.005919 285.0 1338.0 0.004847 0.006211 0.576845 0.008662 0.026161 inf
4 (0.197, 0.246] 2306 0.185169 0.008410 427.0 1879.0 0.007263 0.008723 0.605736 0.009568 0.028891 inf
5 (0.246, 0.295] 3343 0.178881 0.012191 598.0 2745.0 0.010171 0.012743 0.586768 0.006288 0.018968 inf
6 (0.295, 0.344] 4461 0.194351 0.016269 867.0 3594.0 0.014746 0.016684 0.633315 0.015470 0.046547 inf
7 (0.344, 0.393] 5882 0.202992 0.021451 1194.0 4688.0 0.020308 0.021763 0.659152 0.008641 0.025836 inf
8 (0.393, 0.442] 7282 0.204065 0.026556 1486.0 5796.0 0.025274 0.026906 0.662351 0.001073 0.003200 inf
9 (0.442, 0.491] 9184 0.222561 0.033493 2044.0 7140.0 0.034765 0.033145 0.717284 0.018496 0.054933 inf
10 (0.491, 0.54] 11099 0.227408 0.040476 2524.0 8575.0 0.042929 0.039807 0.731611 0.004847 0.014327 inf
11 (0.54, 0.59] 13177 0.232754 0.048054 3067.0 10110.0 0.052164 0.046933 0.747385 0.005346 0.015774 inf
12 (0.59, 0.639] 15165 0.231916 0.055304 3517.0 11648.0 0.059818 0.054072 0.744913 0.000838 0.002472 inf
13 (0.639, 0.688] 17992 0.240218 0.065614 4322.0 13670.0 0.073510 0.063459 0.769359 0.008302 0.024446 inf
14 (0.688, 0.737] 33716 0.200172 0.122957 6749.0 26967.0 0.114789 0.125186 0.650732 0.040046 0.118627 inf
15 (0.737, 0.786] 24314 0.225179 0.088669 5475.0 18839.0 0.093120 0.087454 0.725026 0.025007 0.074294 inf
16 (0.786, 0.835] 29154 0.208376 0.106320 6075.0 23079.0 0.103325 0.107137 0.675195 0.016803 0.049830 inf
17 (0.835, 0.884] 34777 0.205682 0.126826 7153.0 27624.0 0.121660 0.128236 0.667172 0.002694 0.008024 inf
18 (0.884, 0.934] 34197 0.208235 0.124711 7121.0 27076.0 0.121116 0.125692 0.674774 0.002553 0.007602 inf
19 (0.934, 0.983] 18221 0.214697 0.066449 3912.0 14309.0 0.066536 0.066425 0.693982 0.006463 0.019208 inf
20 (0.983, 1.032] 3094 0.250808 0.011283 776.0 2318.0 0.013198 0.010761 0.800452 0.036111 0.106469 inf
21 (1.032, 1.081] 1215 0.262551 0.004431 319.0 896.0 0.005426 0.004159 0.834830 0.011743 0.034379 inf
22 (1.081, 1.13] 671 0.256334 0.002447 172.0 499.0 0.002925 0.002316 0.816640 0.006218 0.018190 inf
23 (1.13, 1.179] 391 0.232737 0.001426 91.0 300.0 0.001548 0.001393 0.747333 0.023597 0.069307 inf
24 (1.179, 1.228] 200 0.255000 0.000729 51.0 149.0 0.000867 0.000692 0.812734 0.022263 0.065401 inf
25 (1.228, 1.277] 158 0.253165 0.000576 40.0 118.0 0.000680 0.000548 0.807358 0.001835 0.005376 inf
26 (1.277, 1.327] 67 0.238806 0.000244 16.0 51.0 0.000272 0.000237 0.765206 0.014359 0.042152 inf
27 (1.327, 1.376] 48 0.291667 0.000175 14.0 34.0 0.000238 0.000158 0.919739 0.052861 0.154533 inf
28 (1.376, 1.425] 27 0.333333 0.000098 9.0 18.0 0.000153 0.000084 1.040954 0.041667 0.121214 inf
29 (1.425, 1.474] 29 0.310345 0.000106 9.0 20.0 0.000153 0.000093 0.974078 0.022989 0.066875 inf
30 (1.474, 1.523] 24 0.208333 0.000088 5.0 19.0 0.000085 0.000088 0.675068 0.102011 0.299010 inf
31 (1.523, 1.572] 12 0.250000 0.000044 3.0 9.0 0.000051 0.000042 0.798082 0.041667 0.123015 inf
32 (1.572, 1.621] 5 0.600000 0.000018 3.0 2.0 0.000051 0.000009 1.871148 0.350000 1.073065 inf
33 (1.621, 1.671] 2 1.000000 0.000007 2.0 0.0 0.000034 0.000000 inf 0.400000 inf inf
34 (1.671, 1.72] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 1.000000 inf inf
35 (1.72, 1.769] 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.000000 0.000000 inf
36 (1.769, 1.818] 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040954 0.333333 1.040954 inf
37 (1.818, 1.867] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.333333 1.040954 inf
38 (1.867, 1.916] 4 0.000000 0.000015 0.0 4.0 0.000000 0.000019 0.000000 0.000000 0.000000 inf
39 (1.916, 1.965] 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040954 0.333333 1.040954 inf
In [1334]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1335]:
# Categories: '< 0.05', '0.05 - 0.2', '0.2 - 0.4', '0.4 - 0.7', '0.7 - 1.', '1. - 1.4', '> 1.4'
df_inputs_prepr['total_balance_to_credit_ratio:0-0.05'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] <= 0.05), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:0.05-0.2'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 0.05) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 0.2), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:0.2-0.4'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 0.2) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 0.4), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:0.4-0.7'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 0.4) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 0.7), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:0.7-1'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 0.7) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 1.), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:1-1.4'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 1.) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 1.4), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:>1.4'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 1.4), 1, 0)

Variable: 'rev_to_il_limit_ratio'¶

In [1336]:
# unique values
df_inputs_prepr['rev_to_il_limit_ratio'].nunique() 
Out[1336]:
215869
In [1337]:
# 'rev_to_il_limit_ratio'
# the categories of everyone with 'rev_to_il_limit_ratio' less or equal 2.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['rev_to_il_limit_ratio'] <= 10., : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['rev_to_il_limit_ratio_factor'] = pd.cut(df_inputs_prepr_temp['rev_to_il_limit_ratio'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'rev_to_il_limit_ratio_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2198525158.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['rev_to_il_limit_ratio_factor'] = pd.cut(df_inputs_prepr_temp['rev_to_il_limit_ratio'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1337]:
rev_to_il_limit_ratio_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.01, 0.2] 27768 0.245534 0.115475 6818.0 20950.0 0.130661 0.111267 0.776706 NaN NaN 0.005387
1 (0.2, 0.4] 42419 0.233575 0.176403 9908.0 32511.0 0.189878 0.172668 0.741779 0.011960 0.034927 0.005387
2 (0.4, 0.6] 34828 0.224302 0.144835 7812.0 27016.0 0.149710 0.143484 0.714610 0.009272 0.027169 0.005387
3 (0.6, 0.8] 39273 0.200392 0.163320 7870.0 31403.0 0.150821 0.166784 0.644111 0.023910 0.070499 0.005387
4 (0.8, 1.0] 18747 0.212834 0.077961 3990.0 14757.0 0.076465 0.078375 0.680882 0.012442 0.036771 0.005387
5 (1.0, 1.2] 13893 0.217520 0.057775 3022.0 10871.0 0.057914 0.057737 0.694680 0.004686 0.013798 0.005387
6 (1.2, 1.4] 10535 0.216421 0.043811 2280.0 8255.0 0.043694 0.043843 0.691449 0.001098 0.003232 0.005387
7 (1.4, 1.6] 8102 0.198470 0.033693 1608.0 6494.0 0.030816 0.034490 0.638410 0.017952 0.053038 0.005387
8 (1.6, 1.8] 6542 0.201009 0.027205 1315.0 5227.0 0.025201 0.027761 0.645938 0.002539 0.007528 0.005387
9 (1.8, 2.0] 5332 0.198612 0.022174 1059.0 4273.0 0.020295 0.022694 0.638834 0.002397 0.007105 0.005387
10 (2.0, 2.2] 4284 0.207049 0.017815 887.0 3397.0 0.016999 0.018042 0.663811 0.008437 0.024977 0.005387
11 (2.2, 2.4] 3594 0.200612 0.014946 721.0 2873.0 0.013817 0.015259 0.644763 0.006437 0.019048 0.005387
12 (2.4, 2.6] 2962 0.194801 0.012318 577.0 2385.0 0.011058 0.012667 0.627519 0.005811 0.017244 0.005387
13 (2.6, 2.8] 2597 0.198691 0.010800 516.0 2081.0 0.009889 0.011052 0.639067 0.003890 0.011548 0.005387
14 (2.8, 3.0] 2204 0.195554 0.009165 431.0 1773.0 0.008260 0.009417 0.629755 0.003137 0.009312 0.005387
15 (3.0, 3.2] 1923 0.197608 0.007997 380.0 1543.0 0.007282 0.008195 0.635854 0.002054 0.006099 0.005387
16 (3.2, 3.4] 1721 0.202789 0.007157 349.0 1372.0 0.006688 0.007287 0.651211 0.005181 0.015356 0.005387
17 (3.4, 3.6] 1404 0.175214 0.005839 246.0 1158.0 0.004714 0.006150 0.569020 0.027575 0.082190 0.005387
18 (3.6, 3.8] 1257 0.183771 0.005227 231.0 1026.0 0.004427 0.005449 0.594652 0.008557 0.025632 0.005387
19 (3.8, 4.0] 1087 0.226311 0.004520 246.0 841.0 0.004714 0.004467 0.720503 0.042540 0.125851 0.005387
20 (4.0, 4.2] 967 0.193382 0.004021 187.0 780.0 0.003584 0.004143 0.623300 0.032929 0.097203 0.005387
21 (4.2, 4.4] 814 0.189189 0.003385 154.0 660.0 0.002951 0.003505 0.610821 0.004192 0.012479 0.005387
22 (4.4, 4.6] 777 0.199485 0.003231 155.0 622.0 0.002970 0.003303 0.641423 0.010296 0.030602 0.005387
23 (4.6, 4.8] 657 0.203957 0.002732 134.0 523.0 0.002568 0.002778 0.654668 0.004472 0.013246 0.005387
24 (4.8, 5.0] 638 0.167712 0.002653 107.0 531.0 0.002051 0.002820 0.546444 0.036246 0.108224 0.005387
25 (5.0, 5.2] 565 0.221239 0.002350 125.0 440.0 0.002396 0.002337 0.705615 0.053527 0.159171 0.005387
26 (5.2, 5.4] 517 0.185687 0.002150 96.0 421.0 0.001840 0.002236 0.600374 0.035552 0.105241 0.005387
27 (5.4, 5.6] 454 0.162996 0.001888 74.0 380.0 0.001418 0.002018 0.532200 0.022691 0.068174 0.005387
28 (5.6, 5.8] 422 0.156398 0.001755 66.0 356.0 0.001265 0.001891 0.512200 0.006597 0.020000 0.005387
29 (5.8, 6.0] 375 0.192000 0.001559 72.0 303.0 0.001380 0.001609 0.619190 0.035602 0.106990 0.005387
30 (6.0, 6.2] 384 0.208333 0.001597 80.0 304.0 0.001533 0.001615 0.667603 0.016333 0.048413 0.005387
31 (6.2, 6.4] 352 0.210227 0.001464 74.0 278.0 0.001418 0.001476 0.673194 0.001894 0.005591 0.005387
32 (6.4, 6.6] 281 0.192171 0.001169 54.0 227.0 0.001035 0.001206 0.619699 0.018056 0.053495 0.005387
33 (6.6, 6.8] 265 0.218868 0.001102 58.0 207.0 0.001112 0.001099 0.698646 0.026697 0.078947 0.005387
34 (6.8, 7.0] 240 0.225000 0.000998 54.0 186.0 0.001035 0.000988 0.716658 0.006132 0.018012 0.005387
35 (7.0, 7.2] 218 0.211009 0.000907 46.0 172.0 0.000882 0.000914 0.675501 0.013991 0.041157 0.005387
36 (7.2, 7.4] 212 0.179245 0.000882 38.0 174.0 0.000728 0.000924 0.581112 0.031764 0.094389 0.005387
37 (7.4, 7.6] 195 0.200000 0.000811 39.0 156.0 0.000747 0.000829 0.642949 0.020755 0.061837 0.005387
38 (7.6, 7.8] 203 0.177340 0.000844 36.0 167.0 0.000690 0.000887 0.575401 0.022660 0.067548 0.005387
39 (7.8, 8.0] 196 0.117347 0.000815 23.0 173.0 0.000441 0.000919 0.391853 0.059993 0.183548 0.005387
40 (8.0, 8.2] 156 0.198718 0.000649 31.0 125.0 0.000594 0.000664 0.639147 0.081371 0.247295 0.005387
41 (8.2, 8.4] 151 0.178808 0.000628 27.0 124.0 0.000517 0.000659 0.579801 0.019910 0.059346 0.005387
42 (8.4, 8.6] 141 0.156028 0.000586 22.0 119.0 0.000422 0.000632 0.511077 0.022780 0.068725 0.005387
43 (8.6, 8.8] 159 0.194969 0.000661 31.0 128.0 0.000594 0.000680 0.628017 0.038940 0.116940 0.005387
44 (8.8, 9.0] 123 0.154472 0.000512 19.0 104.0 0.000364 0.000552 0.506344 0.040497 0.121674 0.005387
45 (9.0, 9.2] 101 0.217822 0.000420 22.0 79.0 0.000422 0.000420 0.695569 0.063350 0.189226 0.005387
46 (9.2, 9.4] 123 0.178862 0.000512 22.0 101.0 0.000422 0.000536 0.579963 0.038960 0.115607 0.005387
47 (9.4, 9.6] 123 0.260163 0.000512 32.0 91.0 0.000613 0.000483 0.819278 0.081301 0.239315 0.005387
48 (9.6, 9.8] 106 0.207547 0.000441 22.0 84.0 0.000422 0.000446 0.665281 0.052615 0.153997 0.005387
49 (9.8, 10.0] 80 0.187500 0.000333 15.0 65.0 0.000287 0.000345 0.605785 0.020047 0.059496 0.005387
In [1338]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1339]:
# Categories: '< 0.6', '0.6 - 0.8', '0.8 - 1.8', '1.8 - 4.5', '4.5 - 10.', '> 10.'
df_inputs_prepr['rev_to_il_limit_ratio:0-0.6'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] <= 0.6), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:0.6-0.8'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 0.6) & (df_inputs_prepr['rev_to_il_limit_ratio'] <= 0.8), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:0.8-1.8'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 0.8) & (df_inputs_prepr['rev_to_il_limit_ratio'] <= 1.8), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:1.8-4.5'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 1.8) & (df_inputs_prepr['rev_to_il_limit_ratio'] <= 4.5), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:4.5-10'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 4.5) & (df_inputs_prepr['rev_to_il_limit_ratio'] <= 10.), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:>10.'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 10.), 1, 0)

Variable: 'total_il_high_credit_limit'¶

In [1340]:
# unique values
df_inputs_prepr['total_il_high_credit_limit'].nunique() 
Out[1340]:
94012
In [1341]:
# 'rev_to_il_limit_ratio'
# the categories of everyone with 'rev_to_il_limit_ratio' less or equal 2.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['total_il_high_credit_limit'] <= 250000., : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['total_il_high_credit_limit_factor'] = pd.cut(df_inputs_prepr_temp['total_il_high_credit_limit'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'total_il_high_credit_limit_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1905072761.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['total_il_high_credit_limit_factor'] = pd.cut(df_inputs_prepr_temp['total_il_high_credit_limit'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1341]:
total_il_high_credit_limit_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-249.996, 4999.92] 36285 0.203087 0.132928 7369.0 28916.0 0.125820 0.134869 0.659021 NaN NaN 0.00276
1 (4999.92, 9999.84] 12562 0.230457 0.046020 2895.0 9667.0 0.049430 0.045089 0.740164 0.027370 0.081143 0.00276
2 (9999.84, 14999.76] 16870 0.220036 0.061802 3712.0 13158.0 0.063379 0.061371 0.709375 0.010421 0.030789 0.00276
3 (14999.76, 19999.68] 18793 0.212792 0.068847 3999.0 14794.0 0.068280 0.069002 0.687900 0.007244 0.021475 0.00276
4 (19999.68, 24999.6] 20033 0.219588 0.073390 4399.0 15634.0 0.075109 0.072920 0.708049 0.006796 0.020149 0.00276
5 (24999.6, 29999.52] 19362 0.224357 0.070931 4344.0 15018.0 0.074170 0.070047 0.722157 0.004769 0.014108 0.00276
6 (29999.52, 34999.44] 30728 0.189827 0.112570 5833.0 24895.0 0.099594 0.116115 0.619349 0.034530 0.102808 0.00276
7 (34999.44, 39999.36] 15447 0.224315 0.056589 3465.0 11982.0 0.059162 0.055886 0.722034 0.034489 0.102685 0.00276
8 (39999.36, 44999.28] 13698 0.219156 0.050182 3002.0 10696.0 0.051257 0.049888 0.706771 0.005159 0.015263 0.00276
9 (44999.28, 49999.2] 11906 0.227448 0.043617 2708.0 9198.0 0.046237 0.042901 0.731288 0.008292 0.024517 0.00276
10 (49999.2, 54999.12] 10240 0.223535 0.037514 2289.0 7951.0 0.039083 0.037085 0.719727 0.003913 0.011560 0.00276
11 (54999.12, 59999.04] 8855 0.223264 0.032440 1977.0 6878.0 0.033756 0.032080 0.718925 0.000271 0.000803 0.00276
12 (59999.04, 64998.96] 7636 0.220272 0.027974 1682.0 5954.0 0.028719 0.027771 0.710076 0.002991 0.008849 0.00276
13 (64998.96, 69998.88] 6648 0.213448 0.024355 1419.0 5229.0 0.024228 0.024389 0.689846 0.006825 0.020229 0.00276
14 (69998.88, 74998.8] 5637 0.221217 0.020651 1247.0 4390.0 0.021291 0.020476 0.712871 0.007769 0.023025 0.00276
15 (74998.8, 79998.72] 4902 0.214810 0.017958 1053.0 3849.0 0.017979 0.017952 0.693890 0.006407 0.018981 0.00276
16 (79998.72, 84998.64] 4278 0.218326 0.015672 934.0 3344.0 0.015947 0.015597 0.704313 0.003516 0.010423 0.00276
17 (84998.64, 89998.56] 3714 0.222940 0.013606 828.0 2886.0 0.014137 0.013461 0.717968 0.004614 0.013655 0.00276
18 (89998.56, 94998.48] 3089 0.220783 0.011316 682.0 2407.0 0.011645 0.011227 0.711588 0.002157 0.006380 0.00276
19 (94998.48, 99998.4] 2642 0.228993 0.009679 605.0 2037.0 0.010330 0.009501 0.735847 0.008210 0.024258 0.00276
20 (99998.4, 104998.32] 2358 0.224343 0.008638 529.0 1829.0 0.009032 0.008531 0.722114 0.004651 0.013732 0.00276
21 (104998.32, 109998.24] 2078 0.207411 0.007613 431.0 1647.0 0.007359 0.007682 0.671904 0.016932 0.050210 0.00276
22 (109998.24, 114998.16] 1749 0.205260 0.006407 359.0 1390.0 0.006130 0.006483 0.665499 0.002151 0.006404 0.00276
23 (114998.16, 119998.08] 1516 0.207784 0.005554 315.0 1201.0 0.005378 0.005602 0.673013 0.002523 0.007514 0.00276
24 (119998.08, 124998.0] 1332 0.209459 0.004880 279.0 1053.0 0.004764 0.004911 0.677998 0.001676 0.004985 0.00276
25 (124998.0, 129997.92] 1208 0.217715 0.004425 263.0 945.0 0.004491 0.004408 0.702503 0.008256 0.024505 0.00276
26 (129997.92, 134997.84] 1046 0.204589 0.003832 214.0 832.0 0.003654 0.003881 0.663499 0.013126 0.039003 0.00276
27 (134997.84, 139997.76] 922 0.233189 0.003378 215.0 707.0 0.003671 0.003298 0.748216 0.028600 0.084716 0.00276
28 (139997.76, 144997.68] 806 0.224566 0.002953 181.0 625.0 0.003090 0.002915 0.722774 0.008623 0.025442 0.00276
29 (144997.68, 149997.6] 740 0.195946 0.002711 145.0 595.0 0.002476 0.002775 0.637689 0.028620 0.085084 0.00276
30 (149997.6, 154997.52] 678 0.228614 0.002484 155.0 523.0 0.002646 0.002439 0.734727 0.032668 0.097037 0.00276
31 (154997.52, 159997.44] 579 0.231434 0.002121 134.0 445.0 0.002288 0.002076 0.743043 0.002820 0.008317 0.00276
32 (159997.44, 164997.36] 518 0.198842 0.001898 103.0 415.0 0.001759 0.001936 0.646349 0.032592 0.096694 0.00276
33 (164997.36, 169997.28] 465 0.212903 0.001703 99.0 366.0 0.001690 0.001707 0.688230 0.014062 0.041881 0.00276
34 (169997.28, 174997.2] 437 0.173913 0.001601 76.0 361.0 0.001298 0.001684 0.571360 0.038990 0.116870 0.00276
35 (174997.2, 179997.12] 398 0.211055 0.001458 84.0 314.0 0.001434 0.001465 0.682741 0.037142 0.111381 0.00276
36 (179997.12, 184997.04] 339 0.182891 0.001242 62.0 277.0 0.001059 0.001292 0.598486 0.028164 0.084255 0.00276
37 (184997.04, 189996.96] 358 0.217877 0.001312 78.0 280.0 0.001332 0.001306 0.702982 0.034986 0.104496 0.00276
38 (189996.96, 194996.88] 278 0.154676 0.001018 43.0 235.0 0.000734 0.001096 0.512722 0.063201 0.190260 0.00276
39 (194996.88, 199996.8] 278 0.241007 0.001018 67.0 211.0 0.001144 0.000984 0.771220 0.086331 0.258498 0.00276
40 (199996.8, 204996.72] 240 0.162500 0.000879 39.0 201.0 0.000666 0.000937 0.536660 0.078507 0.234560 0.00276
41 (204996.72, 209996.64] 184 0.184783 0.000674 34.0 150.0 0.000581 0.000700 0.604184 0.022283 0.067524 0.00276
42 (209996.64, 214996.56] 202 0.198020 0.000740 40.0 162.0 0.000683 0.000756 0.643892 0.013237 0.039708 0.00276
43 (214996.56, 219996.48] 178 0.213483 0.000652 38.0 140.0 0.000649 0.000653 0.689952 0.015463 0.046059 0.00276
44 (219996.48, 224996.4] 157 0.242038 0.000575 38.0 119.0 0.000649 0.000555 0.774249 0.028555 0.084298 0.00276
45 (224996.4, 229996.32] 123 0.186992 0.000451 23.0 100.0 0.000393 0.000466 0.610831 0.055046 0.163418 0.00276
46 (229996.32, 234996.24] 133 0.187970 0.000487 25.0 108.0 0.000427 0.000504 0.613771 0.000978 0.002940 0.00276
47 (234996.24, 239996.16] 123 0.138211 0.000451 17.0 106.0 0.000290 0.000494 0.461905 0.049759 0.151866 0.00276
48 (239996.16, 244996.08] 126 0.166667 0.000462 21.0 105.0 0.000359 0.000490 0.549358 0.028455 0.087453 0.00276
49 (244996.08, 249996.0] 94 0.202128 0.000344 19.0 75.0 0.000324 0.000350 0.656160 0.035461 0.106803 0.00276
In [1342]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1343]:
# Categories: '< 5k', '5 - 10k', '10 - 30k', '30 - 35k', '35 - 100k', '> 100k'
df_inputs_prepr['total_il_high_credit_limit:0-5k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] <= 5000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:5-10k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 5000.) & (df_inputs_prepr['total_il_high_credit_limit'] <= 10000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:10-30k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 10000.) & (df_inputs_prepr['total_il_high_credit_limit'] <= 30000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:30-35k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 30000.) & (df_inputs_prepr['total_il_high_credit_limit'] <= 35000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:35-100k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 35000.) & (df_inputs_prepr['total_il_high_credit_limit'] <= 100000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:>100k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 100000.), 1, 0)

Variable: 'tot_cur_bal'¶

In [1344]:
# unique values
df_inputs_prepr['tot_cur_bal'].nunique() 
Out[1344]:
170012
In [1345]:
# 'tot_cur_bal'
# the categories of everyone with 'tot_cur_bal' less or equal 2.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['tot_cur_bal'] <= 500000., : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['tot_cur_bal_factor'] = pd.cut(df_inputs_prepr_temp['tot_cur_bal'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'tot_cur_bal_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1690865902.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['tot_cur_bal_factor'] = pd.cut(df_inputs_prepr_temp['tot_cur_bal'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1345]:
tot_cur_bal_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-499.976, 9999.52] 16314 0.220792 0.061271 3602.0 12712.0 0.062425 0.060951 0.705166 NaN NaN 0.017143
1 (9999.52, 19999.04] 24906 0.236409 0.093540 5888.0 19018.0 0.102043 0.091187 0.750969 0.015617 0.045803 0.017143
2 (19999.04, 29998.56] 25333 0.249595 0.095143 6323.0 19010.0 0.109582 0.091149 0.789472 0.013186 0.038503 0.017143
3 (29998.56, 39998.08] 20724 0.253571 0.077833 5255.0 15469.0 0.091073 0.074171 0.801053 0.003975 0.011581 0.017143
4 (39998.08, 49997.6] 15744 0.254764 0.059130 4011.0 11733.0 0.069514 0.056257 0.804527 0.001193 0.003473 0.017143
5 (49997.6, 59997.12] 11657 0.259672 0.043780 3027.0 8630.0 0.052460 0.041379 0.818808 0.004909 0.014282 0.017143
6 (59997.12, 69996.64] 8558 0.253681 0.032141 2171.0 6387.0 0.037625 0.030624 0.801374 0.005992 0.017435 0.017143
7 (69996.64, 79996.16] 20487 0.186264 0.076943 3816.0 16671.0 0.066134 0.079934 0.602872 0.067416 0.198502 0.017143
8 (79996.16, 89995.68] 6084 0.236029 0.022850 1436.0 4648.0 0.024887 0.022286 0.749858 0.049764 0.146985 0.017143
9 (89995.68, 99995.2] 5461 0.232192 0.020510 1268.0 4193.0 0.021975 0.020105 0.738625 0.003837 0.011233 0.017143
10 (99995.2, 109994.72] 5111 0.221287 0.019195 1131.0 3980.0 0.019601 0.019083 0.706623 0.010904 0.032002 0.017143
11 (109994.72, 119994.24] 4733 0.213607 0.017776 1011.0 3722.0 0.017521 0.017846 0.684005 0.007681 0.022618 0.017143
12 (119994.24, 129993.76] 4956 0.215093 0.018613 1066.0 3890.0 0.018475 0.018652 0.688387 0.001486 0.004382 0.017143
13 (129993.76, 139993.28] 5013 0.204867 0.018827 1027.0 3986.0 0.017799 0.019112 0.658184 0.010225 0.030203 0.017143
14 (139993.28, 149992.8] 4984 0.200642 0.018718 1000.0 3984.0 0.017331 0.019102 0.645664 0.004225 0.012520 0.017143
15 (149992.8, 159992.32] 5130 0.203509 0.019267 1044.0 4086.0 0.018093 0.019591 0.654161 0.002867 0.008497 0.017143
16 (159992.32, 169991.84] 5065 0.184995 0.019023 937.0 4128.0 0.016239 0.019793 0.599079 0.018514 0.055082 0.017143
17 (169991.84, 179991.36] 4895 0.197549 0.018384 967.0 3928.0 0.016759 0.018834 0.636482 0.012553 0.037403 0.017143
18 (179991.36, 189990.88] 4659 0.191887 0.017498 894.0 3765.0 0.015494 0.018052 0.619642 0.005662 0.016840 0.017143
19 (189990.88, 199990.4] 4589 0.190673 0.017235 875.0 3714.0 0.015164 0.017808 0.616027 0.001213 0.003615 0.017143
20 (199990.4, 209989.92] 4359 0.183758 0.016371 801.0 3558.0 0.013882 0.017060 0.595379 0.006916 0.020648 0.017143
21 (209989.92, 219989.44] 4331 0.189333 0.016266 820.0 3511.0 0.014211 0.016834 0.612030 0.005575 0.016651 0.017143
22 (219989.44, 229988.96] 3853 0.193615 0.014471 746.0 3107.0 0.012929 0.014897 0.624789 0.004283 0.012759 0.017143
23 (229988.96, 239988.48] 3717 0.182136 0.013960 677.0 3040.0 0.011733 0.014576 0.590527 0.011479 0.034262 0.017143
24 (239988.48, 249988.0] 3583 0.177505 0.013457 636.0 2947.0 0.011022 0.014130 0.576644 0.004631 0.013883 0.017143
25 (249988.0, 259987.52] 3313 0.187142 0.012443 620.0 2693.0 0.010745 0.012912 0.605492 0.009637 0.028848 0.017143
26 (259987.52, 269987.04] 3240 0.178395 0.012169 578.0 2662.0 0.010017 0.012764 0.579315 0.008747 0.026177 0.017143
27 (269987.04, 279986.56] 2970 0.176431 0.011154 524.0 2446.0 0.009081 0.011728 0.573419 0.001964 0.005896 0.017143
28 (279986.56, 289986.08] 2807 0.178839 0.010542 502.0 2305.0 0.008700 0.011052 0.580645 0.002408 0.007226 0.017143
29 (289986.08, 299985.6] 2627 0.176247 0.009866 463.0 2164.0 0.008024 0.010376 0.572866 0.002592 0.007780 0.017143
30 (299985.6, 309985.12] 2487 0.177724 0.009340 442.0 2045.0 0.007660 0.009805 0.577302 0.001477 0.004436 0.017143
31 (309985.12, 319984.64] 2357 0.174374 0.008852 411.0 1946.0 0.007123 0.009331 0.567238 0.003350 0.010064 0.017143
32 (319984.64, 329984.16] 2105 0.163420 0.007906 344.0 1761.0 0.005962 0.008444 0.534192 0.010954 0.033047 0.017143
33 (329984.16, 339983.68] 1936 0.169421 0.007271 328.0 1608.0 0.005684 0.007710 0.552324 0.006001 0.018132 0.017143
34 (339983.68, 349983.2] 1862 0.178840 0.006993 333.0 1529.0 0.005771 0.007331 0.580649 0.009418 0.028326 0.017143
35 (349983.2, 359982.72] 1731 0.160601 0.006501 278.0 1453.0 0.004818 0.006967 0.525648 0.018239 0.055001 0.017143
36 (359982.72, 369982.24] 1626 0.170357 0.006107 277.0 1349.0 0.004801 0.006468 0.555143 0.009756 0.029495 0.017143
37 (369982.24, 379981.76] 1472 0.163043 0.005528 240.0 1232.0 0.004159 0.005907 0.533050 0.007313 0.022093 0.017143
38 (379981.76, 389981.28] 1365 0.170696 0.005127 233.0 1132.0 0.004038 0.005428 0.556166 0.007652 0.023116 0.017143
39 (389981.28, 399980.8] 1335 0.168539 0.005014 225.0 1110.0 0.003899 0.005322 0.549662 0.002157 0.006503 0.017143
40 (399980.8, 409980.32] 1245 0.167871 0.004676 209.0 1036.0 0.003622 0.004967 0.547647 0.000668 0.002016 0.017143
41 (409980.32, 419979.84] 1136 0.161092 0.004266 183.0 953.0 0.003172 0.004569 0.527136 0.006780 0.020510 0.017143
42 (419979.84, 429979.36] 1030 0.170874 0.003868 176.0 854.0 0.003050 0.004095 0.556702 0.009782 0.029565 0.017143
43 (429979.36, 439978.88] 979 0.164454 0.003677 161.0 818.0 0.002790 0.003922 0.537318 0.006420 0.019384 0.017143
44 (439978.88, 449978.4] 909 0.155116 0.003414 141.0 768.0 0.002444 0.003682 0.508983 0.009338 0.028335 0.017143
45 (449978.4, 459977.92] 828 0.166667 0.003110 138.0 690.0 0.002392 0.003308 0.544008 0.011551 0.035025 0.017143
46 (459977.92, 469977.44] 757 0.163804 0.002843 124.0 633.0 0.002149 0.003035 0.535354 0.002862 0.008654 0.017143
47 (469977.44, 479976.96] 674 0.173591 0.002531 117.0 557.0 0.002028 0.002671 0.564881 0.009786 0.029527 0.017143
48 (479976.96, 489976.48] 617 0.176661 0.002317 109.0 508.0 0.001889 0.002436 0.574111 0.003071 0.009230 0.017143
49 (489976.48, 499976.0] 607 0.191104 0.002280 116.0 491.0 0.002010 0.002354 0.617310 0.014443 0.043199 0.017143
In [1346]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1347]:
# Categories: '< 20k', '20 - 70k', '70 - 80k', '80 - 130k', '130 - 200k', '200 - 250k', '250 - 500k', '> 500k'
df_inputs_prepr['tot_cur_bal:0-20k'] = np.where((df_inputs_prepr['tot_cur_bal'] <= 20000.), 1, 0)
df_inputs_prepr['tot_cur_bal:20-70k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 20000.) & (df_inputs_prepr['tot_cur_bal'] <= 70000.), 1, 0)
df_inputs_prepr['tot_cur_bal:70-80k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 70000.) & (df_inputs_prepr['tot_cur_bal'] <= 80000.), 1, 0)
df_inputs_prepr['tot_cur_bal:80-130k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 80000.) & (df_inputs_prepr['tot_cur_bal'] <= 130000.), 1, 0)
df_inputs_prepr['tot_cur_bal:130-200k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 130000.) & (df_inputs_prepr['tot_cur_bal'] <= 200000.), 1, 0)
df_inputs_prepr['tot_cur_bal:200-250k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 200000.) & (df_inputs_prepr['tot_cur_bal'] <= 250000.), 1, 0)
df_inputs_prepr['tot_cur_bal:250-500k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 250000.) & (df_inputs_prepr['tot_cur_bal'] <= 500000.), 1, 0)
df_inputs_prepr['tot_cur_bal:>500k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 500000.), 1, 0)

Variable: 'open_act_il'¶

In [1348]:
# unique values
df_inputs_prepr['open_act_il'].unique() 
Out[1348]:
array([ 0.,  4.,  1.,  5.,  2.,  3.,  8.,  9.,  7., 12.,  6., 10., 23.,
       14., 15., 19., 16., 11., 18., 21., 13., 17., 22., 31., 27., 30.,
       20., 25., 26., 24., 32., 35., 42., 29., 28., 53., 36., 45., 40.,
       34., 37., 33.])
In [1349]:
# 'open_act_il'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_act_il', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1349]:
open_act_il n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 173897 0.190676 0.634119 33158.0 140739.0 0.563892 0.653287 0.622275 NaN NaN inf
1 1.0 28616 0.247030 0.104349 7069.0 21547.0 0.120217 0.100018 0.789347 0.056354 0.167072 inf
2 2.0 27842 0.254939 0.101526 7098.0 20744.0 0.120710 0.096290 0.812532 0.007909 0.023185 inf
3 3.0 17470 0.258558 0.063705 4517.0 12953.0 0.076817 0.060126 0.823126 0.003619 0.010594 inf
4 4.0 9205 0.269093 0.033566 2477.0 6728.0 0.042124 0.031230 0.853919 0.010535 0.030793 inf
5 5.0 5079 0.273479 0.018521 1389.0 3690.0 0.023622 0.017128 0.866720 0.004386 0.012801 inf
6 6.0 2911 0.263483 0.010615 767.0 2144.0 0.013044 0.009952 0.837531 0.009996 0.029188 inf
7 7.0 2033 0.250861 0.007413 510.0 1523.0 0.008673 0.007070 0.800584 0.012623 0.036947 inf
8 8.0 1454 0.240028 0.005302 349.0 1105.0 0.005935 0.005129 0.768778 0.010833 0.031807 inf
9 9.0 1148 0.234321 0.004186 269.0 879.0 0.004575 0.004080 0.751980 0.005707 0.016797 inf
10 10.0 948 0.247890 0.003457 235.0 713.0 0.003996 0.003310 0.791872 0.013570 0.039892 inf
11 11.0 783 0.246488 0.002855 193.0 590.0 0.003282 0.002739 0.787757 0.001402 0.004115 inf
12 12.0 597 0.251256 0.002177 150.0 447.0 0.002551 0.002075 0.801743 0.004768 0.013987 inf
13 13.0 461 0.271150 0.001681 125.0 336.0 0.002126 0.001560 0.859923 0.019893 0.058179 inf
14 14.0 361 0.293629 0.001316 106.0 255.0 0.001803 0.001184 0.925426 0.022479 0.065504 inf
15 15.0 330 0.287879 0.001203 95.0 235.0 0.001616 0.001091 0.908688 0.005750 0.016739 inf
16 16.0 210 0.204762 0.000766 43.0 167.0 0.000731 0.000775 0.664410 0.083117 0.244277 inf
17 17.0 210 0.261905 0.000766 55.0 155.0 0.000935 0.000719 0.832917 0.057143 0.168506 inf
18 18.0 160 0.318750 0.000583 51.0 109.0 0.000867 0.000506 0.998498 0.056845 0.165581 inf
19 19.0 110 0.263636 0.000401 29.0 81.0 0.000493 0.000376 0.837979 0.055114 0.160519 inf
20 20.0 91 0.219780 0.000332 20.0 71.0 0.000340 0.000330 0.709032 0.043856 0.128946 inf
21 21.0 80 0.287500 0.000292 23.0 57.0 0.000391 0.000265 0.907585 0.067720 0.198552 inf
22 22.0 54 0.277778 0.000197 15.0 39.0 0.000255 0.000181 0.879257 0.009722 0.028327 inf
23 23.0 43 0.348837 0.000157 15.0 28.0 0.000255 0.000130 1.086097 0.071059 0.206840 inf
24 24.0 39 0.410256 0.000142 16.0 23.0 0.000272 0.000107 1.266567 0.061419 0.180470 inf
25 25.0 21 0.142857 0.000077 3.0 18.0 0.000051 0.000084 0.476616 0.267399 0.789952 inf
26 26.0 23 0.217391 0.000084 5.0 18.0 0.000085 0.000084 0.701953 0.074534 0.225338 inf
27 27.0 15 0.400000 0.000055 6.0 9.0 0.000102 0.000042 1.236185 0.182609 0.534232 inf
28 28.0 7 0.285714 0.000026 2.0 5.0 0.000034 0.000023 0.902384 0.114286 0.333801 inf
29 29.0 10 0.300000 0.000036 3.0 7.0 0.000051 0.000032 0.943965 0.014286 0.041580 inf
30 30.0 7 0.285714 0.000026 2.0 5.0 0.000034 0.000023 0.902384 0.014286 0.041580 inf
31 31.0 6 0.166667 0.000022 1.0 5.0 0.000017 0.000023 0.549702 0.119048 0.352682 inf
32 32.0 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040928 0.166667 0.491225 inf
33 33.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.666667 inf inf
34 34.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.000000 NaN inf
35 35.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.000000 NaN inf
36 36.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.500000 inf inf
37 37.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 inf
38 40.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
39 42.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
40 45.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
41 53.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
In [1350]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1351]:
# Categories: '0', '1-5', '6-15', '>=16'
df_inputs_prepr['open_act_il:0'] = np.where((df_inputs_prepr['open_act_il'] == 0), 1, 0)
df_inputs_prepr['open_act_il:1-5'] = np.where((df_inputs_prepr['open_act_il'] >= 1) & (df_inputs_prepr['open_act_il'] <= 5), 1, 0)
df_inputs_prepr['open_act_il:6-15'] = np.where((df_inputs_prepr['open_act_il'] >= 6) & (df_inputs_prepr['open_act_il'] <= 15), 1, 0)
df_inputs_prepr['open_act_il:>=16'] = np.where((df_inputs_prepr['open_act_il'] >= 16), 1, 0)

Variable: 'open_il_12m'¶

In [1352]:
# unique values
df_inputs_prepr['open_il_12m'].unique() 
Out[1352]:
array([ 0.,  4.,  1.,  3.,  2.,  5.,  6.,  8.,  9.,  7., 11., 10., 20.,
       13., 15.])
In [1353]:
# 'open_il_12m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_il_12m', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1353]:
open_il_12m n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 217562 0.200306 0.793344 43579.0 173983.0 0.741114 0.807601 0.651113 NaN NaN 0.014419
1 1.0 35641 0.257681 0.129966 9184.0 26457.0 0.156185 0.122809 0.820560 0.057375 0.169447 0.014419
2 2.0 14460 0.279391 0.052729 4040.0 10420.0 0.068705 0.048368 0.883961 0.021711 0.063401 0.014419
3 3.0 4417 0.304279 0.016107 1344.0 3073.0 0.022856 0.014264 0.956411 0.024887 0.072450 0.014419
4 4.0 1453 0.307639 0.005298 447.0 1006.0 0.007602 0.004670 0.966185 0.003360 0.009774 0.014419
5 5.0 474 0.291139 0.001728 138.0 336.0 0.002347 0.001560 0.918180 0.016500 0.048005 0.014419
6 6.0 153 0.307190 0.000558 47.0 106.0 0.000799 0.000492 0.964877 0.016050 0.046697 0.014419
7 7.0 41 0.219512 0.000150 9.0 32.0 0.000153 0.000149 0.708238 0.087677 0.256638 0.014419
8 8.0 14 0.428571 0.000051 6.0 8.0 0.000102 0.000037 1.321159 0.209059 0.612921 0.014419
9 9.0 8 0.375000 0.000029 3.0 5.0 0.000051 0.000023 1.162592 0.053571 0.158568 0.014419
10 10.0 4 0.750000 0.000015 3.0 1.0 0.000051 0.000005 2.484161 0.375000 1.321569 0.014419
11 11.0 4 0.500000 0.000015 2.0 2.0 0.000034 0.000009 1.539806 0.250000 0.944355 0.014419
12 13.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 0.014419
13 15.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.014419
14 20.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.014419
In [1354]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1355]:
# Categories: '0', '1-5', '>=6'
df_inputs_prepr['open_il_12m:0'] = np.where((df_inputs_prepr['open_il_12m'] == 0), 1, 0)
df_inputs_prepr['open_il_12m:1-5'] = np.where((df_inputs_prepr['open_il_12m'] >= 1) & (df_inputs_prepr['open_il_12m'] <= 5), 1, 0)
df_inputs_prepr['open_il_12m:>=6'] = np.where((df_inputs_prepr['open_il_12m'] >= 6), 1, 0)

Variable: 'num_actv_rev_tl'¶

In [1356]:
# unique values
df_inputs_prepr['num_actv_rev_tl'].unique() 
Out[1356]:
array([11.,  8.,  5.,  4.,  7., 12.,  0., 13.,  6., 10.,  2.,  3.,  9.,
        1., 20., 14., 23., 16., 19., 15., 17., 25., 22., 18., 24., 26.,
       21., 29., 28., 30., 32., 42., 27., 31., 33., 34., 39., 36., 43.,
       38., 40., 35., 37., 52.])
In [1357]:
# 'open_il_12m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_actv_rev_tl', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1357]:
num_actv_rev_tl n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 14746 0.160179 0.053772 2362.0 12384.0 0.040169 0.057484 0.529907 NaN NaN 0.017707
1 1.0 9100 0.181429 0.033183 1651.0 7449.0 0.028077 0.034577 0.594443 0.021250 0.064536 0.017707
2 2.0 24449 0.182380 0.089154 4459.0 19990.0 0.075831 0.092790 0.597312 0.000951 0.002869 0.017707
3 3.0 36493 0.192256 0.133072 7016.0 29477.0 0.119316 0.136827 0.627016 0.009876 0.029704 0.017707
4 4.0 40005 0.200850 0.145879 8035.0 31970.0 0.136645 0.148399 0.652737 0.008594 0.025722 0.017707
5 5.0 36835 0.208497 0.134320 7680.0 29155.0 0.130608 0.135333 0.675536 0.007647 0.022799 0.017707
6 6.0 30520 0.225557 0.111292 6884.0 23636.0 0.117071 0.109714 0.726123 0.017060 0.050586 0.017707
7 7.0 23382 0.229151 0.085263 5358.0 18024.0 0.091119 0.083664 0.736736 0.003594 0.010613 0.017707
8 8.0 17223 0.242408 0.062804 4175.0 13048.0 0.071001 0.060567 0.775776 0.013258 0.039041 0.017707
9 9.0 12283 0.249613 0.044790 3066.0 9217.0 0.052141 0.042784 0.796926 0.007205 0.021150 0.017707
10 10.0 8827 0.254220 0.032188 2244.0 6583.0 0.038162 0.030557 0.810428 0.004607 0.013501 0.017707
11 11.0 6004 0.282811 0.021894 1698.0 4306.0 0.028877 0.019988 0.893928 0.028591 0.083500 0.017707
12 12.0 4144 0.271477 0.015111 1125.0 3019.0 0.019132 0.014014 0.860878 0.011335 0.033050 0.017707
13 13.0 2961 0.285714 0.010797 846.0 2115.0 0.014387 0.009817 0.902384 0.014237 0.041507 0.017707
14 14.0 2012 0.296223 0.007337 596.0 1416.0 0.010136 0.006573 0.932975 0.010508 0.030591 0.017707
15 15.0 1506 0.291501 0.005492 439.0 1067.0 0.007466 0.004953 0.919232 0.004722 0.013742 0.017707
16 16.0 1100 0.307273 0.004011 338.0 762.0 0.005748 0.003537 0.965119 0.015772 0.045887 0.017707
17 17.0 749 0.308411 0.002731 231.0 518.0 0.003928 0.002404 0.968430 0.001138 0.003311 0.017707
18 18.0 482 0.336100 0.001758 162.0 320.0 0.002755 0.001485 1.048981 0.027688 0.080551 0.017707
19 19.0 376 0.260638 0.001371 98.0 278.0 0.001667 0.001290 0.829213 0.075461 0.219768 0.017707
20 20.0 272 0.294118 0.000992 80.0 192.0 0.001360 0.000891 0.926849 0.033479 0.097636 0.017707
21 21.0 191 0.335079 0.000696 64.0 127.0 0.001088 0.000590 1.046008 0.040961 0.119159 0.017707
22 22.0 147 0.299320 0.000536 44.0 103.0 0.000748 0.000478 0.941985 0.035759 0.104023 0.017707
23 23.0 112 0.401786 0.000408 45.0 67.0 0.000765 0.000311 1.241466 0.102466 0.299481 0.017707
24 24.0 75 0.386667 0.000273 29.0 46.0 0.000493 0.000214 1.196862 0.015119 0.044604 0.017707
25 25.0 57 0.298246 0.000208 17.0 40.0 0.000289 0.000186 0.938861 0.088421 0.258001 0.017707
26 26.0 45 0.400000 0.000164 18.0 27.0 0.000306 0.000125 1.236185 0.101754 0.297325 0.017707
27 27.0 36 0.250000 0.000131 9.0 27.0 0.000153 0.000125 0.798060 0.150000 0.438125 0.017707
28 28.0 23 0.434783 0.000084 10.0 13.0 0.000170 0.000060 1.339784 0.184783 0.541724 0.017707
29 29.0 24 0.250000 0.000088 6.0 18.0 0.000102 0.000084 0.798060 0.184783 0.541724 0.017707
30 30.0 13 0.230769 0.000047 3.0 10.0 0.000051 0.000046 0.741511 0.019231 0.056549 0.017707
31 31.0 11 0.363636 0.000040 4.0 7.0 0.000068 0.000032 1.129314 0.132867 0.387803 0.017707
32 32.0 7 0.285714 0.000026 2.0 5.0 0.000034 0.000023 0.902384 0.077922 0.226930 0.017707
33 33.0 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040928 0.047619 0.138543 0.017707
34 34.0 4 0.500000 0.000015 2.0 2.0 0.000034 0.000009 1.539806 0.166667 0.498878 0.017707
35 35.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 0.017707
36 36.0 4 0.500000 0.000015 2.0 2.0 0.000034 0.000009 1.539806 0.500000 1.539806 0.017707
37 37.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 0.017707
38 38.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.500000 1.539806 0.017707
39 39.0 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.500000 1.539806 0.017707
40 40.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.500000 1.539806 0.017707
41 42.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.000000 0.000000 0.017707
42 43.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 0.017707
43 52.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 0.017707
In [1358]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1359]:
# We create the following categories: '0', '1-5', '6-9', '10-13', '14-17', '18-26', '>=27' 
# '>=27' will be the reference category
df_inputs_prepr['num_actv_rev_tl:0'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin([0]), 1, 0)
df_inputs_prepr['num_actv_rev_tl:1-5'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(1, 6)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:6-9'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(7, 10)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:10-13'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(11, 14)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:14-17'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(14, 18)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:18-26'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(18, 27)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:>=27'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(27, 500)), 1, 0)

Variable: 'open_rv_12m'¶

In [1360]:
# unique values
df_inputs_prepr['open_rv_12m'].unique() 
Out[1360]:
array([ 0.,  2.,  3.,  1.,  5.,  6.,  4., 10.,  7.,  8.,  9., 11., 12.,
       14., 13., 15., 16., 18., 28., 22., 21., 17.])
In [1361]:
# 'open_rv_12m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_rv_12m', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1361]:
open_rv_12m n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 198437 0.193432 0.723605 38384.0 160053.0 0.652767 0.742940 0.630541 NaN NaN inf
1 1.0 33646 0.248053 0.122691 8346.0 25300.0 0.141934 0.117438 0.792350 0.054622 0.161809 inf
2 2.0 20841 0.271292 0.075997 5654.0 15187.0 0.096153 0.070496 0.860339 0.023239 0.067988 inf
3 3.0 10931 0.290824 0.039860 3179.0 7752.0 0.054063 0.035984 0.917263 0.019532 0.056925 inf
4 4.0 5258 0.304869 0.019173 1603.0 3655.0 0.027261 0.016966 0.958127 0.014045 0.040864 inf
5 5.0 2522 0.307692 0.009197 776.0 1746.0 0.013197 0.008105 0.966339 0.002824 0.008212 inf
6 6.0 1247 0.331997 0.004547 414.0 833.0 0.007041 0.003867 1.037037 0.024304 0.070698 inf
7 7.0 620 0.333871 0.002261 207.0 413.0 0.003520 0.001917 1.042493 0.001874 0.005455 inf
8 8.0 321 0.295950 0.001171 95.0 226.0 0.001616 0.001049 0.932182 0.037921 0.110311 inf
9 9.0 161 0.310559 0.000587 50.0 111.0 0.000850 0.000515 0.974676 0.014609 0.042494 inf
10 10.0 112 0.410714 0.000408 46.0 66.0 0.000782 0.000306 1.267927 0.100155 0.293251 inf
11 11.0 47 0.340426 0.000171 16.0 31.0 0.000272 0.000144 1.061580 0.070289 0.206347 inf
12 12.0 32 0.437500 0.000117 14.0 18.0 0.000238 0.000084 1.347952 0.097074 0.286372 inf
13 13.0 21 0.285714 0.000077 6.0 15.0 0.000102 0.000070 0.902384 0.151786 0.445568 inf
14 14.0 13 0.384615 0.000047 5.0 8.0 0.000085 0.000037 1.190828 0.098901 0.288444 inf
15 15.0 12 0.083333 0.000044 1.0 11.0 0.000017 0.000051 0.287479 0.301282 0.903349 inf
16 16.0 6 0.500000 0.000022 3.0 3.0 0.000051 0.000014 1.539806 0.416667 1.252327 inf
17 17.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 inf
18 18.0 3 0.666667 0.000011 2.0 1.0 0.000034 0.000005 2.119548 0.666667 2.119548 inf
19 21.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.333333 inf inf
20 22.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
21 28.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
In [1362]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1363]:
# We create the following categories: '0', '1-2', '3-5', '6-8', '9-13', '>=14' 
# '>=14' will be the reference category
df_inputs_prepr['open_rv_12m:0'] = np.where(df_inputs_prepr['open_rv_12m'].isin([0]), 1, 0)
df_inputs_prepr['open_rv_12m:1-2'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(1, 3)), 1, 0)
df_inputs_prepr['open_rv_12m:3-5'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(3, 6)), 1, 0)
df_inputs_prepr['open_rv_12m:6-8'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(6, 9)), 1, 0)
df_inputs_prepr['open_rv_12m:9-13'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(9, 14)), 1, 0)
df_inputs_prepr['open_rv_12m:>=14'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(14, 500)), 1, 0)

Variable: 'num_bc_tl'¶

In [1364]:
# unique values
df_inputs_prepr['num_bc_tl'].unique() 
Out[1364]:
array([14.,  8.,  3.,  5., 11.,  6., 18.,  0.,  4., 13.,  7.,  9., 19.,
       10., 15., 12., 16.,  2., 17.,  1., 37., 22., 23., 29., 26., 27.,
       20., 28., 21., 33., 36., 24., 49., 34., 25., 35., 31., 32., 38.,
       39., 30., 42., 47., 41., 44., 66., 54., 43., 40., 53., 45., 51.,
       46., 48., 56., 61., 60.])
In [1365]:
# 'num_bc_tl'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_bc_tl', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1365]:
num_bc_tl n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 14229 0.162345 0.051886 2310.0 11919.0 0.039284 0.055326 0.536524 NaN NaN inf
1 1.0 4796 0.255421 0.017489 1225.0 3571.0 0.020833 0.016576 0.813946 0.093077 0.277422 inf
2 2.0 12324 0.240750 0.044940 2967.0 9357.0 0.050457 0.043434 0.770901 0.014671 0.043044 inf
3 3.0 19557 0.227080 0.071315 4441.0 15116.0 0.075525 0.070166 0.730622 0.013670 0.040280 inf
4 4.0 24433 0.226743 0.089095 5540.0 18893.0 0.094214 0.087698 0.729625 0.000337 0.000996 inf
5 5.0 26231 0.220655 0.095652 5788.0 20443.0 0.098432 0.094893 0.711623 0.006088 0.018003 inf
6 6.0 25902 0.221064 0.094452 5726.0 20176.0 0.097378 0.093654 0.712834 0.000409 0.001211 inf
7 7.0 24415 0.211960 0.089030 5175.0 19240.0 0.088007 0.089309 0.685833 0.009104 0.027001 inf
8 8.0 22272 0.212195 0.081215 4726.0 17546.0 0.080371 0.081446 0.686531 0.000235 0.000698 inf
9 9.0 19202 0.208364 0.070020 4001.0 15201.0 0.068042 0.070561 0.675139 0.003831 0.011392 inf
10 10.0 15902 0.216011 0.057987 3435.0 12467.0 0.058416 0.057870 0.697859 0.007647 0.022720 inf
11 11.0 13332 0.210471 0.048615 2806.0 10526.0 0.047719 0.048860 0.681407 0.005540 0.016451 inf
12 12.0 10881 0.202095 0.039678 2199.0 8682.0 0.037397 0.040300 0.656456 0.008376 0.024951 inf
13 13.0 8740 0.216362 0.031871 1891.0 6849.0 0.032159 0.031792 0.698900 0.014266 0.042444 inf
14 14.0 7033 0.200768 0.025646 1412.0 5621.0 0.024013 0.026092 0.652492 0.015594 0.046408 inf
15 15.0 5433 0.214062 0.019812 1163.0 4270.0 0.019778 0.019821 0.692077 0.013294 0.039585 inf
16 16.0 4327 0.203605 0.015778 881.0 3446.0 0.014982 0.015996 0.660961 0.010457 0.031116 inf
17 17.0 3390 0.202655 0.012362 687.0 2703.0 0.011683 0.012547 0.658126 0.000950 0.002835 inf
18 18.0 2626 0.193450 0.009576 508.0 2118.0 0.008639 0.009831 0.630596 0.009205 0.027529 inf
19 19.0 2023 0.207118 0.007377 419.0 1604.0 0.007126 0.007446 0.671431 0.013668 0.040834 inf
20 20.0 1643 0.210590 0.005991 346.0 1297.0 0.005884 0.006020 0.681762 0.003472 0.010332 inf
21 21.0 1244 0.198553 0.004536 247.0 997.0 0.004201 0.004628 0.645874 0.012037 0.035888 inf
22 22.0 993 0.204431 0.003621 203.0 790.0 0.003452 0.003667 0.663424 0.005878 0.017550 inf
23 23.0 720 0.234722 0.002625 169.0 551.0 0.002874 0.002558 0.753163 0.030291 0.089740 inf
24 24.0 556 0.219424 0.002027 122.0 434.0 0.002075 0.002015 0.707979 0.015298 0.045185 inf
25 25.0 423 0.191489 0.001542 81.0 342.0 0.001378 0.001588 0.624716 0.027935 0.083263 inf
26 26.0 351 0.233618 0.001280 82.0 269.0 0.001395 0.001249 0.749911 0.042129 0.125195 inf
27 27.0 266 0.191729 0.000970 51.0 215.0 0.000867 0.000998 0.625436 0.041889 0.124475 inf
28 28.0 227 0.167401 0.000828 38.0 189.0 0.000646 0.000877 0.551937 0.024328 0.073499 inf
29 29.0 168 0.190476 0.000613 32.0 136.0 0.000544 0.000631 0.621675 0.023075 0.069737 inf
30 30.0 124 0.209677 0.000452 26.0 98.0 0.000442 0.000455 0.679047 0.019201 0.057373 inf
31 31.0 93 0.258065 0.000339 24.0 69.0 0.000408 0.000320 0.821683 0.048387 0.142636 inf
32 32.0 82 0.207317 0.000299 17.0 65.0 0.000289 0.000302 0.672023 0.050747 0.149661 inf
33 33.0 78 0.153846 0.000284 12.0 66.0 0.000204 0.000306 0.510500 0.053471 0.161523 inf
34 34.0 37 0.216216 0.000135 8.0 29.0 0.000136 0.000135 0.698469 0.062370 0.187969 inf
35 35.0 35 0.314286 0.000128 11.0 24.0 0.000187 0.000111 0.985514 0.098069 0.287045 inf
36 36.0 29 0.241379 0.000106 7.0 22.0 0.000119 0.000102 0.772752 0.072906 0.212762 inf
37 37.0 18 0.111111 0.000066 2.0 16.0 0.000034 0.000074 0.377039 0.130268 0.395713 inf
38 38.0 23 0.304348 0.000084 7.0 16.0 0.000119 0.000074 0.956612 0.193237 0.579573 inf
39 39.0 17 0.235294 0.000062 4.0 13.0 0.000068 0.000060 0.754848 0.069054 0.201764 inf
40 40.0 8 0.250000 0.000029 2.0 6.0 0.000034 0.000028 0.798060 0.014706 0.043213 inf
41 41.0 10 0.000000 0.000036 0.0 10.0 0.000000 0.000046 0.000000 0.250000 0.798060 inf
42 42.0 6 0.166667 0.000022 1.0 5.0 0.000017 0.000023 0.549702 0.166667 0.549702 inf
43 43.0 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.033333 0.100496 inf
44 44.0 8 0.250000 0.000029 2.0 6.0 0.000034 0.000028 0.798060 0.050000 0.147862 inf
45 45.0 4 0.250000 0.000015 1.0 3.0 0.000017 0.000014 0.798060 0.000000 0.000000 inf
46 46.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.250000 0.798060 inf
47 47.0 3 0.666667 0.000011 2.0 1.0 0.000034 0.000005 2.119548 0.666667 2.119548 inf
48 48.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.666667 2.119548 inf
49 49.0 2 1.000000 0.000007 2.0 0.0 0.000034 0.000000 inf 1.000000 inf inf
50 51.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.500000 inf inf
51 53.0 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.500000 1.539806 inf
52 54.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
53 56.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
54 60.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
55 61.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
56 66.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
In [1366]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1367]:
# We create the following categories: '0', '1-5', '6-10', '11-20', '21-32', '>=33' 
# '>=33' will be the reference category
df_inputs_prepr['num_bc_tl:0'] = np.where(df_inputs_prepr['num_bc_tl'].isin([0]), 1, 0)
df_inputs_prepr['num_bc_tl:1-5'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(1, 6)), 1, 0)
df_inputs_prepr['num_bc_tl:6-10'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(6, 11)), 1, 0)
df_inputs_prepr['num_bc_tl:11-20'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(11, 21)), 1, 0)
df_inputs_prepr['num_bc_tl:21-32'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(21, 33)), 1, 0)
df_inputs_prepr['num_bc_tl:>=33'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(33, 500)), 1, 0)

Variable: 'open_acc_6m'¶

In [1368]:
# unique values
df_inputs_prepr['open_acc_6m'].unique() 
Out[1368]:
array([ 0.,  3.,  1.,  2.,  4., 12.,  5.,  6.,  9.,  7.,  8., 11., 14.,
       10., 15.])
In [1369]:
# 'open_acc_6m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_acc_6m', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1369]:
open_acc_6m n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 207285 0.196367 0.755869 40704.0 166581.0 0.692221 0.773242 0.639335 NaN NaN 0.019426
1 1.0 35777 0.255443 0.130462 9139.0 26638.0 0.155420 0.123649 0.814011 0.059076 0.174676 0.019426
2 2.0 18505 0.275061 0.067479 5090.0 13415.0 0.086562 0.062270 0.871334 0.019617 0.057323 0.019426
3 3.0 7791 0.296496 0.028410 2310.0 5481.0 0.039284 0.025442 0.933770 0.021435 0.062436 0.019426
4 4.0 3070 0.308469 0.011195 947.0 2123.0 0.016105 0.009855 0.968598 0.011973 0.034828 0.019426
5 5.0 1058 0.341210 0.003858 361.0 697.0 0.006139 0.003235 1.063865 0.032741 0.095267 0.019426
6 6.0 467 0.336188 0.001703 157.0 310.0 0.002670 0.001439 1.049240 0.005021 0.014625 0.019426
7 7.0 142 0.359155 0.000518 51.0 91.0 0.000867 0.000422 1.116214 0.022966 0.066975 0.019426
8 8.0 65 0.230769 0.000237 15.0 50.0 0.000255 0.000232 0.741511 0.128386 0.374703 0.019426
9 9.0 44 0.386364 0.000160 17.0 27.0 0.000289 0.000125 1.195970 0.155594 0.454459 0.019426
10 10.0 11 0.363636 0.000040 4.0 7.0 0.000068 0.000032 1.129314 0.022727 0.066656 0.019426
11 11.0 9 0.333333 0.000033 3.0 6.0 0.000051 0.000028 1.040928 0.030303 0.088387 0.019426
12 12.0 5 0.400000 0.000018 2.0 3.0 0.000034 0.000014 1.236185 0.066667 0.195258 0.019426
13 14.0 4 0.500000 0.000015 2.0 2.0 0.000034 0.000009 1.539806 0.100000 0.303621 0.019426
14 15.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 0.019426
In [1370]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1371]:
# We create the following categories: '0', '1-3', '4-7', '>=8' 
# '>=8' will be the reference category
df_inputs_prepr['open_acc_6m:0'] = np.where(df_inputs_prepr['open_acc_6m'].isin([0]), 1, 0)
df_inputs_prepr['open_acc_6m:1-3'] = np.where(df_inputs_prepr['open_acc_6m'].isin(range(1, 4)), 1, 0)
df_inputs_prepr['open_acc_6m:4-7'] = np.where(df_inputs_prepr['open_acc_6m'].isin(range(4, 8)), 1, 0)
df_inputs_prepr['open_acc_6m:>=8'] = np.where(df_inputs_prepr['open_acc_6m'].isin(range(8, 500)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1521767103.py:6: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['open_acc_6m:>=8'] = np.where(df_inputs_prepr['open_acc_6m'].isin(range(8, 500)), 1, 0)

Variable: 'acc_open_past_24mths'¶

In [1372]:
# unique values
df_inputs_prepr['acc_open_past_24mths'].unique() 
Out[1372]:
array([ 4.,  9.,  8.,  2., 10.,  3.,  5., 13.,  6.,  7.,  0.,  1., 12.,
       17., 11., 16., 14., 15., 18., 19., 20., 27., 29., 21., 22., 26.,
       25., 24., 23., 31., 33., 34., 30., 32., 28., 39., 40., 36., 41.,
       35., 38., 42., 46.])
In [1373]:
# 'acc_open_past_24mths'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'acc_open_past_24mths', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1373]:
acc_open_past_24mths n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 19825 0.153090 0.072292 3035.0 16790.0 0.051614 0.077936 0.508176 NaN NaN inf
1 1.0 24052 0.160111 0.087706 3851.0 20201.0 0.065491 0.093770 0.529700 0.007022 0.021524 inf
2 2.0 34456 0.175238 0.125645 6038.0 28418.0 0.102684 0.131912 0.575729 0.015127 0.046029 inf
3 3.0 38994 0.193773 0.142192 7556.0 31438.0 0.128499 0.145930 0.631566 0.018535 0.055836 inf
4 4.0 37269 0.210550 0.135902 7847.0 29422.0 0.133448 0.136572 0.681643 0.016777 0.050078 inf
5 5.0 31871 0.221989 0.116218 7075.0 24796.0 0.120319 0.115099 0.715570 0.011438 0.033927 inf
6 6.0 25390 0.240843 0.092585 6115.0 19275.0 0.103993 0.089471 0.771175 0.018854 0.055605 inf
7 7.0 19210 0.250182 0.070050 4806.0 14404.0 0.081732 0.066861 0.798595 0.009339 0.027420 inf
8 8.0 13730 0.274144 0.050067 3764.0 9966.0 0.064011 0.046261 0.868660 0.023962 0.070066 inf
9 9.0 9350 0.274545 0.034095 2567.0 6783.0 0.043655 0.031486 0.869831 0.000401 0.001170 inf
10 10.0 6548 0.295968 0.023877 1938.0 4610.0 0.032958 0.021399 0.932234 0.021423 0.062403 inf
11 11.0 4405 0.300795 0.016063 1325.0 3080.0 0.022533 0.014297 0.946276 0.004826 0.014042 inf
12 12.0 2919 0.298047 0.010644 870.0 2049.0 0.014795 0.009511 0.938283 0.002747 0.007992 inf
13 13.0 1897 0.317343 0.006917 602.0 1295.0 0.010238 0.006011 0.994406 0.019296 0.056123 inf
14 14.0 1344 0.308036 0.004901 414.0 930.0 0.007041 0.004317 0.967338 0.009307 0.027068 inf
15 15.0 894 0.329978 0.003260 295.0 599.0 0.005017 0.002780 1.031161 0.021942 0.063823 inf
16 16.0 591 0.340102 0.002155 201.0 390.0 0.003418 0.001810 1.060636 0.010124 0.029475 inf
17 17.0 399 0.293233 0.001455 117.0 282.0 0.001990 0.001309 0.924275 0.046868 0.136361 inf
18 18.0 318 0.371069 0.001160 118.0 200.0 0.002007 0.000928 1.151070 0.077836 0.226795 inf
19 19.0 227 0.356828 0.000828 81.0 146.0 0.001378 0.000678 1.109418 0.014241 0.041652 inf
20 20.0 144 0.381944 0.000525 55.0 89.0 0.000935 0.000413 1.182976 0.025116 0.073559 inf
21 21.0 110 0.309091 0.000401 34.0 76.0 0.000578 0.000353 0.970406 0.072854 0.212570 inf
22 22.0 86 0.325581 0.000314 28.0 58.0 0.000476 0.000269 1.018369 0.016490 0.047963 inf
23 23.0 54 0.259259 0.000197 14.0 40.0 0.000238 0.000186 0.825179 0.066322 0.193190 inf
24 24.0 37 0.243243 0.000135 9.0 28.0 0.000153 0.000130 0.778229 0.016016 0.046950 inf
25 25.0 28 0.500000 0.000102 14.0 14.0 0.000238 0.000065 1.539806 0.256757 0.761577 inf
26 26.0 21 0.476190 0.000077 10.0 11.0 0.000170 0.000051 1.465711 0.023810 0.074095 inf
27 27.0 10 0.300000 0.000036 3.0 7.0 0.000051 0.000032 0.943965 0.176190 0.521747 inf
28 28.0 9 0.222222 0.000033 2.0 7.0 0.000034 0.000032 0.716262 0.077778 0.227703 inf
29 29.0 11 0.636364 0.000040 7.0 4.0 0.000119 0.000019 2.003026 0.414141 1.286764 inf
30 30.0 7 0.142857 0.000026 1.0 6.0 0.000017 0.000028 0.476616 0.493506 1.526410 inf
31 31.0 8 0.500000 0.000029 4.0 4.0 0.000068 0.000019 1.539806 0.357143 1.063190 inf
32 32.0 6 0.166667 0.000022 1.0 5.0 0.000017 0.000023 0.549702 0.333333 0.990104 inf
33 33.0 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.166667 0.549702 inf
34 34.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
35 35.0 3 0.333333 0.000011 1.0 2.0 0.000017 0.000009 1.040928 0.666667 inf inf
36 36.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.333333 1.040928 inf
37 38.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
38 39.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
39 40.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
40 41.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
41 42.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
42 46.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
In [1374]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1375]:
# We create the following categories: '0-3', '4-7', '8-13', '14-21', '>=22' 
# '>=22' will be the reference category
df_inputs_prepr['acc_open_past_24mths:0-3'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(0, 4)), 1, 0)
df_inputs_prepr['acc_open_past_24mths:4-7'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(4, 8)), 1, 0)
df_inputs_prepr['acc_open_past_24mths:8-13'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(8, 14)), 1, 0)
df_inputs_prepr['acc_open_past_24mths:14-21'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(14, 22)), 1, 0)
df_inputs_prepr['acc_open_past_24mths:>=22'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(22, 500)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:3: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['acc_open_past_24mths:0-3'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(0, 4)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:4: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['acc_open_past_24mths:4-7'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(4, 8)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:5: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['acc_open_past_24mths:8-13'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(8, 14)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:6: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['acc_open_past_24mths:14-21'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(14, 22)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['acc_open_past_24mths:>=22'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(22, 500)), 1, 0)

Variable: 'total_cu_tl'¶

In [1376]:
# unique values
df_inputs_prepr['total_cu_tl'].unique() 
Out[1376]:
array([ 0.,  2.,  1.,  8.,  4.,  3.,  5.,  6., 10., 11., 22.,  9.,  7.,
       12., 17., 13., 15., 24., 14., 19., 16., 31., 23., 20., 21., 18.,
       27., 28., 33., 26., 38., 34., 25., 29., 48., 37., 32., 43., 30.,
       40., 41.])
In [1377]:
# 'total_cu_tl'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_cu_tl', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1377]:
total_cu_tl n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 220366 0.206797 0.803569 45571.0 174795.0 0.774991 0.811370 0.670474 NaN NaN inf
1 1.0 18961 0.251516 0.069142 4769.0 14192.0 0.081103 0.065877 0.802506 0.044719 0.132032 inf
2 2.0 10641 0.247251 0.038803 2631.0 8010.0 0.044743 0.037181 0.789997 0.004265 0.012508 inf
3 3.0 6975 0.245591 0.025434 1713.0 5262.0 0.029132 0.024425 0.785125 0.001660 0.004872 inf
4 4.0 4696 0.244676 0.017124 1149.0 3547.0 0.019540 0.016465 0.782439 0.000915 0.002687 inf
5 5.0 3361 0.238322 0.012256 801.0 2560.0 0.013622 0.011883 0.763761 0.006354 0.018678 inf
6 6.0 2405 0.241996 0.008770 582.0 1823.0 0.009898 0.008462 0.774564 0.003674 0.010803 inf
7 7.0 1715 0.233819 0.006254 401.0 1314.0 0.006819 0.006099 0.750503 0.008177 0.024061 inf
8 8.0 1313 0.221630 0.004788 291.0 1022.0 0.004949 0.004744 0.714509 0.012189 0.035994 inf
9 9.0 941 0.224230 0.003431 211.0 730.0 0.003588 0.003389 0.722199 0.002600 0.007690 inf
10 10.0 683 0.248902 0.002491 170.0 513.0 0.002891 0.002381 0.794840 0.024672 0.072641 inf
11 11.0 520 0.240385 0.001896 125.0 395.0 0.002126 0.001834 0.769828 0.008517 0.025012 inf
12 12.0 392 0.252551 0.001429 99.0 293.0 0.001684 0.001360 0.805538 0.012166 0.035710 inf
13 13.0 284 0.193662 0.001036 55.0 229.0 0.000935 0.001063 0.631232 0.058889 0.174307 inf
14 14.0 225 0.213333 0.000820 48.0 177.0 0.000816 0.000822 0.689913 0.019671 0.058681 inf
15 15.0 164 0.195122 0.000598 32.0 132.0 0.000544 0.000613 0.635606 0.018211 0.054307 inf
16 16.0 148 0.283784 0.000540 42.0 106.0 0.000714 0.000492 0.896761 0.088662 0.261155 inf
17 17.0 81 0.271605 0.000295 22.0 59.0 0.000374 0.000274 0.861251 0.012179 0.035509 inf
18 18.0 73 0.164384 0.000266 12.0 61.0 0.000204 0.000283 0.542746 0.107221 0.318506 inf
19 19.0 56 0.303571 0.000204 17.0 39.0 0.000289 0.000181 0.954353 0.139188 0.411608 inf
20 20.0 53 0.188679 0.000193 10.0 43.0 0.000170 0.000200 0.616277 0.114892 0.338077 inf
21 21.0 40 0.225000 0.000146 9.0 31.0 0.000153 0.000144 0.724476 0.036321 0.108200 inf
22 22.0 19 0.315789 0.000069 6.0 13.0 0.000102 0.000060 0.989887 0.090789 0.265411 inf
23 23.0 26 0.192308 0.000095 5.0 21.0 0.000085 0.000097 0.627171 0.123482 0.362717 inf
24 24.0 21 0.619048 0.000077 13.0 8.0 0.000221 0.000037 1.939243 0.426740 1.312073 inf
25 25.0 16 0.187500 0.000058 3.0 13.0 0.000051 0.000060 0.612732 0.431548 1.326512 inf
26 26.0 11 0.363636 0.000040 4.0 7.0 0.000068 0.000032 1.129314 0.176136 0.516583 inf
27 27.0 9 0.222222 0.000033 2.0 7.0 0.000034 0.000032 0.716262 0.141414 0.413053 inf
28 28.0 8 0.125000 0.000029 1.0 7.0 0.000017 0.000032 0.420934 0.097222 0.295328 inf
29 29.0 4 0.250000 0.000015 1.0 3.0 0.000017 0.000014 0.798060 0.125000 0.377126 inf
30 30.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.250000 0.798060 inf
31 31.0 6 0.333333 0.000022 2.0 4.0 0.000034 0.000019 1.040928 0.333333 1.040928 inf
32 32.0 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.133333 0.390729 inf
33 33.0 4 0.250000 0.000015 1.0 3.0 0.000017 0.000014 0.798060 0.050000 0.147862 inf
34 34.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.250000 0.798060 inf
35 37.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
36 38.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.500000 1.539806 inf
37 40.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 inf
38 41.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
39 43.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
40 48.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.500000 1.539806 inf
In [1378]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1379]:
# We create the following categories: '0', '1-7', '8-17', '>=18' 
# '>=18' will be the reference category
df_inputs_prepr['total_cu_tl:0'] = np.where(df_inputs_prepr['total_cu_tl'].isin([0]), 1, 0)
df_inputs_prepr['total_cu_tl:1-7'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(1, 8)), 1, 0)
df_inputs_prepr['total_cu_tl:8-17'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(8, 18)), 1, 0)
df_inputs_prepr['total_cu_tl:>=18'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(18, 500)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3794626681.py:3: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['total_cu_tl:0'] = np.where(df_inputs_prepr['total_cu_tl'].isin([0]), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3794626681.py:4: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['total_cu_tl:1-7'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(1, 8)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3794626681.py:5: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['total_cu_tl:8-17'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(8, 18)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3794626681.py:6: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['total_cu_tl:>=18'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(18, 500)), 1, 0)

Variable: 'inq_last_12m'¶

In [1380]:
# unique values
df_inputs_prepr['inq_last_12m'].unique() 
Out[1380]:
array([ 0.,  4.,  2.,  3.,  1.,  5.,  7., 12.,  6., 14., 11., 18.,  8.,
       20., 10., 15.,  9., 16., 28., 13., 21., 17., 22., 19., 32., 26.,
       23., 25., 29., 24., 33., 34., 31., 30., 27., 40.])
In [1381]:
# 'inq_last_12m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'inq_last_12m', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1381]:
inq_last_12m n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 188518 0.190889 0.687435 35986.0 152532.0 0.611986 0.708029 0.622914 NaN NaN inf
1 1.0 26092 0.245746 0.095145 6412.0 19680.0 0.109044 0.091351 0.785579 0.054857 0.162665 inf
2 2.0 19977 0.256145 0.072847 5117.0 14860.0 0.087021 0.068978 0.816064 0.010399 0.030485 inf
3 3.0 13606 0.273629 0.049615 3723.0 9883.0 0.063314 0.045875 0.867158 0.017485 0.051095 inf
4 4.0 9261 0.276212 0.033770 2558.0 6703.0 0.043502 0.031114 0.874692 0.002583 0.007534 inf
5 5.0 5760 0.284028 0.021004 1636.0 4124.0 0.027822 0.019143 0.897472 0.007816 0.022780 inf
6 6.0 3730 0.285255 0.013602 1064.0 2666.0 0.018095 0.012375 0.901045 0.001227 0.003574 inf
7 7.0 2383 0.300042 0.008690 715.0 1668.0 0.012159 0.007743 0.944087 0.014787 0.043041 inf
8 8.0 1593 0.321406 0.005809 512.0 1081.0 0.008707 0.005018 1.006223 0.021364 0.062137 inf
9 9.0 1024 0.303711 0.003734 311.0 713.0 0.005289 0.003310 0.954759 0.017695 0.051464 inf
10 10.0 653 0.320061 0.002381 209.0 444.0 0.003554 0.002061 1.002311 0.016350 0.047552 inf
11 11.0 495 0.339394 0.001805 168.0 327.0 0.002857 0.001518 1.058575 0.019333 0.056263 inf
12 12.0 293 0.300341 0.001068 88.0 205.0 0.001497 0.000952 0.944957 0.039053 0.113617 inf
13 13.0 224 0.330357 0.000817 74.0 150.0 0.001258 0.000696 1.032265 0.030016 0.087308 inf
14 14.0 186 0.413978 0.000678 77.0 109.0 0.001309 0.000506 1.277625 0.083621 0.245360 inf
15 15.0 104 0.384615 0.000379 40.0 64.0 0.000680 0.000297 1.190828 0.029363 0.086797 inf
16 16.0 98 0.367347 0.000357 36.0 62.0 0.000612 0.000288 1.140170 0.017268 0.050657 inf
17 17.0 58 0.293103 0.000211 17.0 41.0 0.000289 0.000190 0.923897 0.074243 0.216273 inf
18 18.0 42 0.309524 0.000153 13.0 29.0 0.000221 0.000135 0.971665 0.016420 0.047768 inf
19 19.0 40 0.250000 0.000146 10.0 30.0 0.000170 0.000139 0.798060 0.059524 0.173605 inf
20 20.0 21 0.285714 0.000077 6.0 15.0 0.000102 0.000070 0.902384 0.035714 0.104324 inf
21 21.0 16 0.562500 0.000058 9.0 7.0 0.000153 0.000032 1.742298 0.276786 0.839914 inf
22 22.0 14 0.500000 0.000051 7.0 7.0 0.000119 0.000032 1.539806 0.062500 0.202492 inf
23 23.0 13 0.307692 0.000047 4.0 9.0 0.000068 0.000042 0.966339 0.192308 0.573467 inf
24 24.0 7 0.428571 0.000026 3.0 4.0 0.000051 0.000019 1.321159 0.120879 0.354820 inf
25 25.0 7 0.285714 0.000026 2.0 5.0 0.000034 0.000023 0.902384 0.142857 0.418775 inf
26 26.0 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.085714 0.252186 inf
27 27.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.200000 0.650199 inf
28 28.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
29 29.0 4 0.250000 0.000015 1.0 3.0 0.000017 0.000014 0.798060 0.750000 inf inf
30 30.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.250000 0.798060 inf
31 31.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
32 32.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
33 33.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
34 34.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 inf
35 40.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
In [1382]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1383]:
# We create the following categories: '0', '1-4', '5-9', '10-16', '>=17' 
# '>=17' will be the reference category
df_inputs_prepr['inq_last_12m:0'] = np.where(df_inputs_prepr['inq_last_12m'].isin([0]), 1, 0)
df_inputs_prepr['inq_last_12m:1-4'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(1, 5)), 1, 0)
df_inputs_prepr['inq_last_12m:5-9'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(5, 10)), 1, 0)
df_inputs_prepr['inq_last_12m:10-16'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(10, 17)), 1, 0)
df_inputs_prepr['inq_last_12m:>=17'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(17, 500)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:3: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['inq_last_12m:0'] = np.where(df_inputs_prepr['inq_last_12m'].isin([0]), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:4: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['inq_last_12m:1-4'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(1, 5)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:5: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['inq_last_12m:5-9'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(5, 10)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:6: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['inq_last_12m:10-16'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(10, 17)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr['inq_last_12m:>=17'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(17, 500)), 1, 0)

Variable: 'mths_since_recent_inq'¶

In [1384]:
# unique values
df_inputs_prepr['mths_since_recent_inq'].unique() 
Out[1384]:
array([  5.,   3.,  21.,   0.,   1.,  19.,   2.,  11.,   4., 999.,  12.,
         7.,   9.,  10.,  15.,  14.,  20.,   8.,   6.,  18.,  22.,  13.,
        16.,  23.,  17.,  24.,  25.])
In [1385]:
# 'mths_since_recent_inq'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'mths_since_recent_inq', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1385]:
mths_since_recent_inq n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 22048 0.267371 0.080398 5895.0 16153.0 0.100252 0.074980 0.848891 NaN NaN 0.017868
1 1.0 27914 0.258472 0.101789 7215.0 20699.0 0.122700 0.096081 0.822877 0.008899 0.026014 0.017868
2 2.0 22108 0.240773 0.080617 5323.0 16785.0 0.090524 0.077913 0.770968 0.017700 0.051909 0.017868
3 3.0 19797 0.228671 0.072190 4527.0 15270.0 0.076987 0.070881 0.735320 0.012102 0.035648 0.017868
4 4.0 17738 0.221896 0.064682 3936.0 13802.0 0.066936 0.064067 0.715298 0.006775 0.020022 0.017868
5 5.0 15701 0.209031 0.057254 3282.0 12419.0 0.055814 0.057647 0.677125 0.012865 0.038173 0.017868
6 6.0 13973 0.212911 0.050953 2975.0 10998.0 0.050594 0.051051 0.688657 0.003879 0.011532 0.017868
7 7.0 13218 0.216069 0.048200 2856.0 10362.0 0.048570 0.048099 0.698032 0.003158 0.009375 0.017868
8 8.0 11630 0.216853 0.042409 2522.0 9108.0 0.042890 0.042278 0.700357 0.000784 0.002325 0.017868
9 9.0 10086 0.208903 0.036779 2107.0 7979.0 0.035832 0.037037 0.676745 0.007950 0.023613 0.017868
10 10.0 8841 0.199638 0.032239 1765.0 7076.0 0.030016 0.032846 0.649117 0.009265 0.027628 0.017868
11 11.0 7851 0.193861 0.028629 1522.0 6329.0 0.025883 0.029378 0.631827 0.005777 0.017290 0.017868
12 12.0 6969 0.199598 0.025413 1391.0 5578.0 0.023656 0.025892 0.648998 0.005738 0.017171 0.017868
13 13.0 6299 0.193364 0.022969 1218.0 5081.0 0.020714 0.023585 0.630338 0.006234 0.018660 0.017868
14 14.0 5481 0.201423 0.019987 1104.0 4377.0 0.018775 0.020317 0.654449 0.008059 0.024111 0.017868
15 15.0 4819 0.202739 0.017573 977.0 3842.0 0.016615 0.017834 0.658377 0.001316 0.003928 0.017868
16 16.0 4109 0.208080 0.014984 855.0 3254.0 0.014540 0.015105 0.674294 0.005341 0.015916 0.017868
17 17.0 3661 0.195575 0.013350 716.0 2945.0 0.012176 0.013670 0.636963 0.012505 0.037331 0.017868
18 18.0 3349 0.180950 0.012212 606.0 2743.0 0.010306 0.012733 0.592997 0.014625 0.043966 0.017868
19 19.0 3052 0.186435 0.011129 569.0 2483.0 0.009677 0.011526 0.609528 0.005486 0.016531 0.017868
20 20.0 2628 0.177702 0.009583 467.0 2161.0 0.007942 0.010031 0.583185 0.008733 0.026344 0.017868
21 21.0 2406 0.174979 0.008774 421.0 1985.0 0.007160 0.009214 0.574945 0.002722 0.008239 0.017868
22 22.0 2187 0.172840 0.007975 378.0 1809.0 0.006428 0.008397 0.568460 0.002140 0.006485 0.017868
23 23.0 2086 0.170662 0.007607 356.0 1730.0 0.006054 0.008030 0.561850 0.002178 0.006610 0.017868
24 24.0 975 0.177436 0.003555 173.0 802.0 0.002942 0.003723 0.582381 0.006774 0.020531 0.017868
25 25.0 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.177436 0.582381 0.017868
26 999.0 35305 0.159921 0.128740 5646.0 29659.0 0.096017 0.137672 0.529117 0.159921 0.529117 0.017868
In [1386]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1389]:
# We create the following categories: '0-1', '2-3', '4-6', '7-10', '11-15', '>=16, 'Missing' = 999.
# '>=17' will be the reference category
df_inputs_prepr['mths_since_recent_inq:Missing'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin([999]), 1, 0)
df_inputs_prepr['mths_since_recent_inq:0-1'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(0, 2)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:2-3'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(2, 4)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:4-6'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(4, 7)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:7-10'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(7, 11)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:11-15'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(11, 16)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:>=16'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(16, 500)), 1, 0)

Variable: 'out_prncp'¶

In [1390]:
# unique values
df_inputs_prepr['out_prncp'].unique() 
Out[1390]:
array([    0.  , 17331.34,   357.91, ..., 13382.28, 24725.78, 29443.26])
In [1391]:
df_inputs_prepr.loc[df_inputs_prepr['out_prncp'] == 0, : ]['out_prncp'].count()
Out[1391]:
269105
In [1392]:
# one other category will be created for 'out_prncp' value = 0.
#********************************
# 'out_prncp'
# the categories of everyone with 'out_prncp' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['out_prncp'] != 0, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['out_prncp_factor'] = pd.cut(df_inputs_prepr_temp['out_prncp'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'out_prncp_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\469751821.py:9: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['out_prncp_factor'] = pd.cut(df_inputs_prepr_temp['out_prncp'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\469751821.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['out_prncp_factor'] = pd.cut(df_inputs_prepr_temp['out_prncp'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1392]:
out_prncp_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-29.798, 791.883] 131 1.0 0.025541 131.0 0.0 0.025541 NaN NaN NaN NaN 0.0
1 (791.883, 1574.435] 223 1.0 0.043478 223.0 0.0 0.043478 NaN NaN 0.0 NaN 0.0
2 (1574.435, 2356.988] 251 1.0 0.048937 251.0 0.0 0.048937 NaN NaN 0.0 NaN 0.0
3 (2356.988, 3139.54] 218 1.0 0.042503 218.0 0.0 0.042503 NaN NaN 0.0 NaN 0.0
4 (3139.54, 3922.093] 226 1.0 0.044063 226.0 0.0 0.044063 NaN NaN 0.0 NaN 0.0
5 (3922.093, 4704.646] 266 1.0 0.051862 266.0 0.0 0.051862 NaN NaN 0.0 NaN 0.0
6 (4704.646, 5487.198] 221 1.0 0.043088 221.0 0.0 0.043088 NaN NaN 0.0 NaN 0.0
7 (5487.198, 6269.751] 212 1.0 0.041334 212.0 0.0 0.041334 NaN NaN 0.0 NaN 0.0
8 (6269.751, 7052.303] 193 1.0 0.037629 193.0 0.0 0.037629 NaN NaN 0.0 NaN 0.0
9 (7052.303, 7834.856] 191 1.0 0.037239 191.0 0.0 0.037239 NaN NaN 0.0 NaN 0.0
10 (7834.856, 8617.409] 202 1.0 0.039384 202.0 0.0 0.039384 NaN NaN 0.0 NaN 0.0
11 (8617.409, 9399.961] 237 1.0 0.046208 237.0 0.0 0.046208 NaN NaN 0.0 NaN 0.0
12 (9399.961, 10182.514] 201 1.0 0.039189 201.0 0.0 0.039189 NaN NaN 0.0 NaN 0.0
13 (10182.514, 10965.066] 154 1.0 0.030025 154.0 0.0 0.030025 NaN NaN 0.0 NaN 0.0
14 (10965.066, 11747.619] 159 1.0 0.031000 159.0 0.0 0.031000 NaN NaN 0.0 NaN 0.0
15 (11747.619, 12530.172] 161 1.0 0.031390 161.0 0.0 0.031390 NaN NaN 0.0 NaN 0.0
16 (12530.172, 13312.724] 131 1.0 0.025541 131.0 0.0 0.025541 NaN NaN 0.0 NaN 0.0
17 (13312.724, 14095.277] 157 1.0 0.030610 157.0 0.0 0.030610 NaN NaN 0.0 NaN 0.0
18 (14095.277, 14877.829] 108 1.0 0.021057 108.0 0.0 0.021057 NaN NaN 0.0 NaN 0.0
19 (14877.829, 15660.382] 104 1.0 0.020277 104.0 0.0 0.020277 NaN NaN 0.0 NaN 0.0
20 (15660.382, 16442.935] 99 1.0 0.019302 99.0 0.0 0.019302 NaN NaN 0.0 NaN 0.0
21 (16442.935, 17225.487] 109 1.0 0.021252 109.0 0.0 0.021252 NaN NaN 0.0 NaN 0.0
22 (17225.487, 18008.04] 97 1.0 0.018912 97.0 0.0 0.018912 NaN NaN 0.0 NaN 0.0
23 (18008.04, 18790.592] 99 1.0 0.019302 99.0 0.0 0.019302 NaN NaN 0.0 NaN 0.0
24 (18790.592, 19573.145] 86 1.0 0.016767 86.0 0.0 0.016767 NaN NaN 0.0 NaN 0.0
25 (19573.145, 20355.698] 60 1.0 0.011698 60.0 0.0 0.011698 NaN NaN 0.0 NaN 0.0
26 (20355.698, 21138.25] 70 1.0 0.013648 70.0 0.0 0.013648 NaN NaN 0.0 NaN 0.0
27 (21138.25, 21920.803] 68 1.0 0.013258 68.0 0.0 0.013258 NaN NaN 0.0 NaN 0.0
28 (21920.803, 22703.355] 66 1.0 0.012868 66.0 0.0 0.012868 NaN NaN 0.0 NaN 0.0
29 (22703.355, 23485.908] 53 1.0 0.010333 53.0 0.0 0.010333 NaN NaN 0.0 NaN 0.0
30 (23485.908, 24268.461] 52 1.0 0.010138 52.0 0.0 0.010138 NaN NaN 0.0 NaN 0.0
31 (24268.461, 25051.013] 59 1.0 0.011503 59.0 0.0 0.011503 NaN NaN 0.0 NaN 0.0
32 (25051.013, 25833.566] 29 1.0 0.005654 29.0 0.0 0.005654 NaN NaN 0.0 NaN 0.0
33 (25833.566, 26616.118] 45 1.0 0.008774 45.0 0.0 0.008774 NaN NaN 0.0 NaN 0.0
34 (26616.118, 27398.671] 57 1.0 0.011113 57.0 0.0 0.011113 NaN NaN 0.0 NaN 0.0
35 (27398.671, 28181.224] 42 1.0 0.008189 42.0 0.0 0.008189 NaN NaN 0.0 NaN 0.0
36 (28181.224, 28963.776] 42 1.0 0.008189 42.0 0.0 0.008189 NaN NaN 0.0 NaN 0.0
37 (28963.776, 29746.329] 43 1.0 0.008384 43.0 0.0 0.008384 NaN NaN 0.0 NaN 0.0
38 (29746.329, 30528.881] 24 1.0 0.004679 24.0 0.0 0.004679 NaN NaN 0.0 NaN 0.0
39 (30528.881, 31311.434] 28 1.0 0.005459 28.0 0.0 0.005459 NaN NaN 0.0 NaN 0.0
40 (31311.434, 32093.987] 23 1.0 0.004484 23.0 0.0 0.004484 NaN NaN 0.0 NaN 0.0
41 (32093.987, 32876.539] 20 1.0 0.003899 20.0 0.0 0.003899 NaN NaN 0.0 NaN 0.0
42 (32876.539, 33659.092] 34 1.0 0.006629 34.0 0.0 0.006629 NaN NaN 0.0 NaN 0.0
43 (33659.092, 34441.644] 23 1.0 0.004484 23.0 0.0 0.004484 NaN NaN 0.0 NaN 0.0
44 (34441.644, 35224.197] 10 1.0 0.001950 10.0 0.0 0.001950 NaN NaN 0.0 NaN 0.0
45 (35224.197, 36006.75] 10 1.0 0.001950 10.0 0.0 0.001950 NaN NaN 0.0 NaN 0.0
46 (36006.75, 36789.302] 8 1.0 0.001560 8.0 0.0 0.001560 NaN NaN 0.0 NaN 0.0
47 (36789.302, 37571.855] 14 1.0 0.002730 14.0 0.0 0.002730 NaN NaN 0.0 NaN 0.0
48 (37571.855, 38354.407] 7 1.0 0.001365 7.0 0.0 0.001365 NaN NaN 0.0 NaN 0.0
49 (38354.407, 39136.96] 6 1.0 0.001170 6.0 0.0 0.001170 NaN NaN 0.0 NaN 0.0
In [1393]:
plot_by_woe(df_temp.iloc[: 50, : ], 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1395]:
# Categories: '=0', '>0'
df_inputs_prepr['out_prncp:=0'] = np.where((df_inputs_prepr['out_prncp'] == 0.), 1, 0)
df_inputs_prepr['out_prncp:>0'] = np.where((df_inputs_prepr['out_prncp'] > 0.), 1, 0)

Variable: 'last_pymnt_amnt'¶

In [1396]:
# unique values
df_inputs_prepr['last_pymnt_amnt'].nunique() 
Out[1396]:
193962
In [1397]:
df_inputs_prepr['last_pymnt_amnt'].max()
Out[1397]:
42148.53
In [1398]:
# one other category will be created for 'last_pymnt_amnt_factor' value > 10000.
#********************************
# 'last_pymnt_amnt_factor'
# the categories of everyone with 'last_pymnt_amnt_factor' less or equal 10000.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['last_pymnt_amnt'] <= 10000., : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['last_pymnt_amnt_factor'] = pd.cut(df_inputs_prepr_temp['last_pymnt_amnt'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'last_pymnt_amnt_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\462666328.py:9: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['last_pymnt_amnt_factor'] = pd.cut(df_inputs_prepr_temp['last_pymnt_amnt'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\462666328.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['last_pymnt_amnt_factor'] = pd.cut(df_inputs_prepr_temp['last_pymnt_amnt'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1398]:
last_pymnt_amnt_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-10.0, 200.0] 30839 0.326859 0.140436 10080.0 20759.0 0.171572 0.129064 0.845591 NaN NaN 0.705538
1 (200.0, 400.0] 39046 0.510885 0.177810 19948.0 19098.0 0.339535 0.118737 1.350552 0.184026 0.504960 0.705538
2 (400.0, 600.0] 24991 0.548037 0.113805 13696.0 11295.0 0.233119 0.070224 1.463178 0.037153 0.112626 0.705538
3 (600.0, 800.0] 15988 0.472167 0.072807 7549.0 8439.0 0.128491 0.052467 1.238079 0.075871 0.225099 0.705538
4 (800.0, 1000.0] 9179 0.445364 0.041800 4088.0 5091.0 0.069582 0.031652 1.162632 0.026802 0.075447 0.705538
5 (1000.0, 1200.0] 5887 0.316460 0.026809 1863.0 4024.0 0.031710 0.025018 0.818670 0.128904 0.343962 0.705538
6 (1200.0, 1400.0] 4087 0.232934 0.018612 952.0 3135.0 0.016204 0.019491 0.605056 0.083526 0.213614 0.705538
7 (1400.0, 1600.0] 3053 0.054045 0.013903 165.0 2888.0 0.002808 0.017955 0.145323 0.178888 0.459733 0.705538
8 (1600.0, 1800.0] 2748 0.035298 0.012514 97.0 2651.0 0.001651 0.016482 0.095467 0.018747 0.049856 0.705538
9 (1800.0, 2000.0] 2905 0.026506 0.013229 77.0 2828.0 0.001311 0.017582 0.071894 0.008792 0.023573 0.705538
10 (2000.0, 2200.0] 2630 0.018251 0.011977 48.0 2582.0 0.000817 0.016053 0.049642 0.008255 0.022252 0.705538
11 (2200.0, 2400.0] 2827 0.013442 0.012874 38.0 2789.0 0.000647 0.017340 0.036622 0.004809 0.013020 0.705538
12 (2400.0, 2600.0] 2846 0.008784 0.012960 25.0 2821.0 0.000426 0.017539 0.023972 0.004658 0.012650 0.705538
13 (2600.0, 2800.0] 2566 0.008574 0.011685 22.0 2544.0 0.000374 0.015817 0.023399 0.000211 0.000573 0.705538
14 (2800.0, 3000.0] 2587 0.004639 0.011781 12.0 2575.0 0.000204 0.016009 0.012678 0.003935 0.010722 0.705538
15 (3000.0, 3200.0] 2565 0.002729 0.011681 7.0 2558.0 0.000119 0.015904 0.007464 0.001910 0.005214 0.705538
16 (3200.0, 3400.0] 2455 0.001629 0.011180 4.0 2451.0 0.000068 0.015238 0.004458 0.001100 0.003006 0.705538
17 (3400.0, 3600.0] 2479 0.002824 0.011289 7.0 2472.0 0.000119 0.015369 0.007723 0.001194 0.003265 0.705538
18 (3600.0, 3800.0] 2603 0.001921 0.011854 5.0 2598.0 0.000085 0.016152 0.005255 0.000903 0.002467 0.705538
19 (3800.0, 4000.0] 2471 0.004856 0.011253 12.0 2459.0 0.000204 0.015288 0.013272 0.002935 0.008017 0.705538
20 (4000.0, 4200.0] 2523 0.002378 0.011489 6.0 2517.0 0.000102 0.015649 0.006505 0.002478 0.006767 0.705538
21 (4200.0, 4400.0] 2257 0.002215 0.010278 5.0 2252.0 0.000085 0.014001 0.006060 0.000163 0.000445 0.705538
22 (4400.0, 4600.0] 2299 0.000870 0.010469 2.0 2297.0 0.000034 0.014281 0.002381 0.001345 0.003679 0.705538
23 (4600.0, 4800.0] 2246 0.001781 0.010228 4.0 2242.0 0.000068 0.013939 0.004873 0.000911 0.002492 0.705538
24 (4800.0, 5000.0] 2361 0.001271 0.010752 3.0 2358.0 0.000051 0.014660 0.003477 0.000510 0.001395 0.705538
25 (5000.0, 5200.0] 2346 0.000426 0.010683 1.0 2345.0 0.000017 0.014579 0.001167 0.000844 0.002310 0.705538
26 (5200.0, 5400.0] 2027 0.001480 0.009231 3.0 2024.0 0.000051 0.012584 0.004050 0.001054 0.002883 0.705538
27 (5400.0, 5600.0] 2067 0.000484 0.009413 1.0 2066.0 0.000017 0.012845 0.001324 0.000996 0.002725 0.705538
28 (5600.0, 5800.0] 1971 0.000507 0.008976 1.0 1970.0 0.000017 0.012248 0.001389 0.000024 0.000064 0.705538
29 (5800.0, 6000.0] 2057 0.000486 0.009367 1.0 2056.0 0.000017 0.012783 0.001331 0.000021 0.000058 0.705538
30 (6000.0, 6200.0] 2054 0.000974 0.009354 2.0 2052.0 0.000034 0.012758 0.002665 0.000488 0.001334 0.705538
31 (6200.0, 6400.0] 1793 0.001673 0.008165 3.0 1790.0 0.000051 0.011129 0.004578 0.000699 0.001913 0.705538
32 (6400.0, 6600.0] 1756 0.001139 0.007997 2.0 1754.0 0.000034 0.010905 0.003117 0.000534 0.001461 0.705538
33 (6600.0, 6800.0] 1817 0.001101 0.008274 2.0 1815.0 0.000034 0.011284 0.003012 0.000038 0.000105 0.705538
34 (6800.0, 7000.0] 1786 0.001680 0.008133 3.0 1783.0 0.000051 0.011085 0.004596 0.000579 0.001584 0.705538
35 (7000.0, 7200.0] 1875 0.000000 0.008538 0.0 1875.0 0.000000 0.011657 0.000000 0.001680 0.004596 0.705538
36 (7200.0, 7400.0] 1702 0.001175 0.007751 2.0 1700.0 0.000034 0.010569 0.003216 0.001175 0.003216 0.705538
37 (7400.0, 7600.0] 1708 0.001756 0.007778 3.0 1705.0 0.000051 0.010600 0.004806 0.000581 0.001590 0.705538
38 (7600.0, 7800.0] 1551 0.000000 0.007063 0.0 1551.0 0.000000 0.009643 0.000000 0.001756 0.004806 0.705538
39 (7800.0, 8000.0] 1564 0.000639 0.007122 1.0 1563.0 0.000017 0.009718 0.001750 0.000639 0.001750 0.705538
40 (8000.0, 8200.0] 1742 0.000574 0.007933 1.0 1741.0 0.000017 0.010824 0.001571 0.000065 0.000179 0.705538
41 (8200.0, 8400.0] 1562 0.000640 0.007113 1.0 1561.0 0.000017 0.009705 0.001752 0.000066 0.000181 0.705538
42 (8400.0, 8600.0] 1593 0.000628 0.007254 1.0 1592.0 0.000017 0.009898 0.001718 0.000012 0.000034 0.705538
43 (8600.0, 8800.0] 1431 0.001398 0.006517 2.0 1429.0 0.000034 0.008884 0.003824 0.000770 0.002106 0.705538
44 (8800.0, 9000.0] 1455 0.000687 0.006626 1.0 1454.0 0.000017 0.009040 0.001881 0.000710 0.001943 0.705538
45 (9000.0, 9200.0] 1495 0.000669 0.006808 1.0 1494.0 0.000017 0.009289 0.001831 0.000018 0.000050 0.705538
46 (9200.0, 9400.0] 1458 0.000000 0.006640 0.0 1458.0 0.000000 0.009065 0.000000 0.000669 0.001831 0.705538
47 (9400.0, 9600.0] 1419 0.000705 0.006462 1.0 1418.0 0.000017 0.008816 0.001929 0.000705 0.001929 0.705538
48 (9600.0, 9800.0] 1462 0.000000 0.006658 0.0 1462.0 0.000000 0.009090 0.000000 0.000705 0.001929 0.705538
49 (9800.0, 10000.0] 1465 0.002048 0.006671 3.0 1462.0 0.000051 0.009090 0.005602 0.002048 0.005602 0.705538
In [1399]:
#plot_by_woe(df_temp.iloc[7: , : ], 90)
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1402]:
# Categories: '<=200', '200-700', '700-1000', '1000-1500', '1500-2600', '2600-10K', ' >10K'
df_inputs_prepr['last_pymnt_amnt:<=200'] = np.where((df_inputs_prepr['last_pymnt_amnt'] <= 200), 1, 0)
df_inputs_prepr['last_pymnt_amnt:200-700'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 200) & (df_inputs_prepr['last_pymnt_amnt'] <= 700), 1, 0)
df_inputs_prepr['last_pymnt_amnt:700-1000'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 700) & (df_inputs_prepr['last_pymnt_amnt'] <= 1000), 1, 0)
df_inputs_prepr['last_pymnt_amnt:1000-1500'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 1000) & (df_inputs_prepr['last_pymnt_amnt'] <= 1500), 1, 0)
df_inputs_prepr['last_pymnt_amnt:1500-2600'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 1500) & (df_inputs_prepr['last_pymnt_amnt'] <= 2600), 1, 0)
df_inputs_prepr['last_pymnt_amnt:2600-10000'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 2600) & (df_inputs_prepr['last_pymnt_amnt'] <= 10000), 1, 0)
df_inputs_prepr['last_pymnt_amnt:>10000'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 10000), 1, 0)

Variable: 'principal_paid_ratio'¶

In [1403]:
df_inputs_prepr['principal_paid_ratio'].nunique()
Out[1403]:
54695
In [1404]:
df_inputs_prepr.loc[df_inputs_prepr['principal_paid_ratio'] >= 1., : ]['principal_paid_ratio'].count()
Out[1404]:
214732
In [1405]:
# one other category will be created for 'principal_paid_ratio' value = 1 with a count of 858698.
#********************************
# 'principal_paid_ratio'
# the categories of everyone with 'principal_paid_ratio' < 1.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['principal_paid_ratio'] < 1., : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['principal_paid_ratio_factor'] = pd.cut(df_inputs_prepr_temp['principal_paid_ratio'], 20)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'principal_paid_ratio_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1212135044.py:9: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['principal_paid_ratio_factor'] = pd.cut(df_inputs_prepr_temp['principal_paid_ratio'], 20)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1212135044.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['principal_paid_ratio_factor'] = pd.cut(df_inputs_prepr_temp['principal_paid_ratio'], 20)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1405]:
principal_paid_ratio_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.001, 0.05] 4479 1.000000 0.075275 4479.0 0.0 0.076193 0.000000 inf NaN NaN inf
1 (0.05, 0.1] 6358 1.000000 0.106854 6358.0 0.0 0.108157 0.000000 inf 0.000000 NaN inf
2 (0.1, 0.15] 6557 1.000000 0.110198 6557.0 0.0 0.111542 0.000000 inf 0.000000 NaN inf
3 (0.15, 0.2] 5993 0.999833 0.100719 5992.0 1.0 0.101931 0.001395 4.305204 0.000167 inf inf
4 (0.2, 0.25] 5382 0.999814 0.090451 5381.0 1.0 0.091537 0.001395 4.199185 0.000019 0.106020 inf
5 (0.25, 0.3] 4927 1.000000 0.082804 4927.0 0.0 0.083814 0.000000 inf 0.000186 inf inf
6 (0.3, 0.35] 4240 1.000000 0.071258 4240.0 0.0 0.072127 0.000000 inf 0.000000 NaN inf
7 (0.35, 0.4] 3571 1.000000 0.060015 3571.0 0.0 0.060747 0.000000 inf 0.000000 NaN inf
8 (0.4, 0.45] 3241 0.999383 0.054469 3239.0 2.0 0.055099 0.002789 3.032692 0.000617 inf inf
9 (0.45, 0.5] 2611 0.999234 0.043881 2609.0 2.0 0.044382 0.002789 2.827963 0.000149 0.204729 inf
10 (0.5, 0.55] 2332 0.996141 0.039192 2323.0 9.0 0.039517 0.012552 1.422669 0.003093 1.405293 inf
11 (0.55, 0.6] 1908 0.998428 0.032066 1905.0 3.0 0.032406 0.004184 2.168492 0.002287 0.745823 inf
12 (0.6, 0.65] 1744 0.934060 0.029310 1629.0 115.0 0.027711 0.160391 0.159371 0.064368 2.009121 inf
13 (0.65, 0.7] 1528 0.958770 0.025680 1465.0 63.0 0.024921 0.087866 0.249691 0.024710 0.090320 inf
14 (0.7, 0.75] 1234 0.960292 0.020739 1185.0 49.0 0.020158 0.068340 0.258486 0.001522 0.008795 inf
15 (0.75, 0.8] 889 0.970754 0.014941 863.0 26.0 0.014681 0.036262 0.339928 0.010462 0.081442 inf
16 (0.8, 0.85] 813 0.977860 0.013663 795.0 18.0 0.013524 0.025105 0.430938 0.007106 0.091010 inf
17 (0.85, 0.9] 573 0.963351 0.009630 552.0 21.0 0.009390 0.029289 0.278091 0.014509 0.152847 inf
18 (0.9, 0.95] 486 0.979424 0.008168 476.0 10.0 0.008097 0.013947 0.457790 0.016073 0.179699 inf
19 (0.95, 1.0] 636 0.375786 0.010689 239.0 397.0 0.004066 0.553696 0.007316 0.603638 0.450474 inf
In [1406]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1408]:
# Categories: '<= 0.3', '0.3 - 0.45', '0.45 - 0.6', '0.6 - 1', ' = 1'
df_inputs_prepr['principal_paid_ratio:<=0.3'] = np.where((df_inputs_prepr['principal_paid_ratio'] <= 0.3), 1, 0)
df_inputs_prepr['principal_paid_ratio:0.3-0.45'] = np.where((df_inputs_prepr['principal_paid_ratio'] > 0.3) & (df_inputs_prepr['principal_paid_ratio'] <= 0.45), 1, 0)
df_inputs_prepr['principal_paid_ratio:0.45-0.6'] = np.where((df_inputs_prepr['principal_paid_ratio'] > 0.45) & (df_inputs_prepr['principal_paid_ratio'] <= 0.6), 1, 0)
df_inputs_prepr['principal_paid_ratio:0.6-1'] = np.where((df_inputs_prepr['principal_paid_ratio'] > 0.6) & (df_inputs_prepr['principal_paid_ratio'] <= 1.), 1, 0)
df_inputs_prepr['principal_paid_ratio:=1'] = np.where((df_inputs_prepr['principal_paid_ratio'] >= 1.), 1, 0)

Variable: 'fico_range_high'¶

In [1409]:
df_inputs_prepr['fico_range_high'].unique()
Out[1409]:
array([679., 699., 664., 709., 684., 704., 714., 674., 739., 754., 694.,
       749., 779., 669., 784., 689., 719., 724., 729., 819., 794., 769.,
       774., 734., 744., 789., 759., 764., 809., 814., 799., 824., 834.,
       804., 829., 844., 839., 850.])
In [1410]:
# 'fico_range_high'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'fico_range_high', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1410]:
fico_range_high n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 664.0 24699 0.282238 0.090065 6971.0 17728.0 0.118550 0.082290 0.892258 NaN NaN 0.051934
1 669.0 23877 0.275495 0.087068 6578.0 17299.0 0.111867 0.080299 0.872601 0.006743 0.019656 0.051934
2 674.0 23894 0.261781 0.087130 6255.0 17639.0 0.106374 0.081877 0.832555 0.013714 0.040046 0.051934
3 679.0 21241 0.250741 0.077456 5326.0 15915.0 0.090575 0.073875 0.800234 0.011040 0.032321 0.051934
4 684.0 21156 0.243288 0.077146 5147.0 16009.0 0.087531 0.074311 0.778361 0.007454 0.021874 0.051934
5 689.0 18350 0.231281 0.066914 4244.0 14106.0 0.072174 0.065478 0.743020 0.012007 0.035341 0.051934
6 694.0 17828 0.222796 0.065010 3972.0 13856.0 0.067549 0.064317 0.717958 0.008485 0.025062 0.051934
7 699.0 16181 0.219640 0.059004 3554.0 12627.0 0.060440 0.058612 0.708618 0.003155 0.009340 0.051934
8 704.0 14750 0.201627 0.053786 2974.0 11776.0 0.050577 0.054662 0.655058 0.018013 0.053560 0.051934
9 709.0 13447 0.192459 0.049035 2588.0 10859.0 0.044012 0.050406 0.627625 0.009168 0.027433 0.051934
10 714.0 11568 0.174187 0.042183 2015.0 9553.0 0.034268 0.044343 0.572546 0.018272 0.055079 0.051934
11 719.0 10455 0.164610 0.038124 1721.0 8734.0 0.029268 0.040542 0.543437 0.009577 0.029110 0.051934
12 724.0 8922 0.160278 0.032534 1430.0 7492.0 0.024319 0.034777 0.530210 0.004332 0.013227 0.051934
13 729.0 6979 0.160768 0.025449 1122.0 5857.0 0.019081 0.027187 0.531708 0.000490 0.001498 0.051934
14 734.0 6186 0.150824 0.022557 933.0 5253.0 0.015867 0.024384 0.501210 0.009944 0.030498 0.051934
15 739.0 4915 0.143235 0.017923 704.0 4211.0 0.011972 0.019547 0.477785 0.007589 0.023425 0.051934
16 744.0 4417 0.132443 0.016107 585.0 3832.0 0.009949 0.017788 0.444240 0.010792 0.033545 0.051934
17 749.0 3528 0.118197 0.012865 417.0 3111.0 0.007092 0.014441 0.399502 0.014246 0.044738 0.051934
18 754.0 3296 0.119842 0.012019 395.0 2901.0 0.006717 0.013466 0.404696 0.001645 0.005194 0.051934
19 759.0 2772 0.128788 0.010108 357.0 2415.0 0.006071 0.011210 0.432813 0.008946 0.028117 0.051934
20 764.0 2270 0.117181 0.008278 266.0 2004.0 0.004524 0.009302 0.396288 0.011607 0.036525 0.051934
21 769.0 2082 0.109030 0.007592 227.0 1855.0 0.003860 0.008611 0.370413 0.008151 0.025875 0.051934
22 774.0 1914 0.115987 0.006979 222.0 1692.0 0.003775 0.007854 0.392512 0.006958 0.022100 0.051934
23 779.0 1664 0.100962 0.006068 168.0 1496.0 0.002857 0.006944 0.344603 0.015026 0.047909 0.051934
24 784.0 1505 0.087043 0.005488 131.0 1374.0 0.002228 0.006378 0.299588 0.013918 0.045015 0.051934
25 789.0 1246 0.079454 0.004544 99.0 1147.0 0.001684 0.005324 0.274764 0.007589 0.024824 0.051934
26 794.0 1061 0.071631 0.003869 76.0 985.0 0.001292 0.004572 0.248952 0.007824 0.025812 0.051934
27 799.0 878 0.079727 0.003202 70.0 808.0 0.001190 0.003751 0.275659 0.008096 0.026707 0.051934
28 804.0 773 0.087969 0.002819 68.0 705.0 0.001156 0.003272 0.302603 0.008242 0.026944 0.051934
29 809.0 690 0.088406 0.002516 61.0 629.0 0.001037 0.002920 0.304024 0.000437 0.001421 0.051934
30 814.0 477 0.056604 0.001739 27.0 450.0 0.000459 0.002089 0.198704 0.031802 0.105320 0.051934
31 819.0 397 0.080605 0.001448 32.0 365.0 0.000544 0.001694 0.278540 0.024001 0.079836 0.051934
32 824.0 302 0.096026 0.001101 29.0 273.0 0.000493 0.001267 0.328716 0.015422 0.050175 0.051934
33 829.0 217 0.082949 0.000791 18.0 199.0 0.000306 0.000924 0.286222 0.013077 0.042493 0.051934
34 834.0 122 0.081967 0.000445 10.0 112.0 0.000170 0.000520 0.283007 0.000982 0.003215 0.051934
35 839.0 75 0.013333 0.000273 1.0 74.0 0.000017 0.000343 0.048323 0.068634 0.234685 0.051934
36 844.0 54 0.055556 0.000197 3.0 51.0 0.000051 0.000237 0.195164 0.042222 0.146842 0.051934
37 850.0 46 0.130435 0.000168 6.0 40.0 0.000102 0.000186 0.437966 0.074879 0.242802 0.051934
In [1411]:
plot_by_woe(df_temp.iloc[2: , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1413]:
# We create the following categories: '< 680', '680-700', '700-720', '720-750', '750-795', '> 795.
# '> 795' will be the reference category
df_inputs_prepr['fico_range_high:<=680'] = np.where((df_inputs_prepr['fico_range_high'] <= 680), 1, 0)
df_inputs_prepr['fico_range_high:680-700'] = np.where((df_inputs_prepr['fico_range_high'] > 680) & (df_inputs_prepr['fico_range_high'] <= 700), 1, 0)
df_inputs_prepr['fico_range_high:700-720'] = np.where((df_inputs_prepr['fico_range_high'] > 700) & (df_inputs_prepr['fico_range_high'] <= 720), 1, 0)
df_inputs_prepr['fico_range_high:720-750'] = np.where((df_inputs_prepr['fico_range_high'] > 720) & (df_inputs_prepr['fico_range_high'] <= 750), 1, 0)
df_inputs_prepr['fico_range_high:750-795'] = np.where((df_inputs_prepr['fico_range_high'] > 750) & (df_inputs_prepr['fico_range_high'] <= 795), 1, 0)
df_inputs_prepr['fico_range_high:>795'] = np.where((df_inputs_prepr['fico_range_high'] > 795), 1, 0)

Variable: 'last_fico_range_high'¶

In [1414]:
df_inputs_prepr['last_fico_range_high'].unique()
Out[1414]:
array([574., 709., 714., 749., 664., 799., 589., 689., 499., 694., 579.,
       784., 774., 669., 554., 684., 529., 659., 734., 769., 739., 569.,
       834., 634., 594., 674., 614., 724., 789., 719., 729., 704., 544.,
       629., 584., 524., 599., 509., 699., 779., 619., 549., 744., 654.,
       649., 794., 754., 539., 624., 804., 759., 604., 519., 534., 679.,
       764., 644., 819., 609., 839., 559., 829., 639., 504., 564., 809.,
       814., 824., 514.,   0., 844., 850.])
In [1415]:
# 'fico_range_high'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'last_fico_range_high', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1415]:
last_fico_range_high n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 45 0.088889 0.000164 4.0 41.0 0.000068 0.000190 0.305595 NaN NaN 1.778145
1 499.0 7207 0.872485 0.026280 6288.0 919.0 0.106935 0.004266 3.260698 0.783596 2.955103 1.778145
2 504.0 1510 0.851656 0.005506 1286.0 224.0 0.021870 0.001040 3.092563 0.020829 0.168135 1.778145
3 509.0 1679 0.854080 0.006123 1434.0 245.0 0.024387 0.001137 3.111013 0.002424 0.018450 1.778145
4 514.0 1928 0.821577 0.007030 1584.0 344.0 0.026938 0.001597 2.883123 0.032503 0.227890 1.778145
5 519.0 1964 0.815173 0.007162 1601.0 363.0 0.027227 0.001685 2.842498 0.006404 0.040625 1.778145
6 524.0 2277 0.819060 0.008303 1865.0 412.0 0.031717 0.001912 2.867012 0.003887 0.024515 1.778145
7 529.0 2252 0.804618 0.008212 1812.0 440.0 0.030815 0.002042 2.778056 0.014442 0.088956 1.778145
8 534.0 2427 0.791100 0.008850 1920.0 507.0 0.032652 0.002353 2.699636 0.013518 0.078421 1.778145
9 539.0 2544 0.768475 0.009277 1955.0 589.0 0.033247 0.002734 2.577216 0.022625 0.122420 1.778145
10 544.0 2901 0.772148 0.010579 2240.0 661.0 0.038094 0.003068 2.596412 0.003673 0.019196 1.778145
11 549.0 2580 0.760465 0.009408 1962.0 618.0 0.033366 0.002869 2.536179 0.011682 0.060233 1.778145
12 554.0 2850 0.759649 0.010393 2165.0 685.0 0.036818 0.003180 2.532059 0.000816 0.004119 1.778145
13 559.0 2678 0.725915 0.009765 1944.0 734.0 0.033060 0.003407 2.370550 0.033734 0.161510 1.778145
14 564.0 2956 0.727673 0.010779 2151.0 805.0 0.036580 0.003737 2.378578 0.001758 0.008028 1.778145
15 569.0 2614 0.730298 0.009532 1909.0 705.0 0.032465 0.003272 2.390645 0.002626 0.012067 1.778145
16 574.0 2826 0.673036 0.010305 1902.0 924.0 0.032346 0.004289 2.144934 0.057262 0.245710 1.778145
17 579.0 2673 0.667789 0.009747 1785.0 888.0 0.030356 0.004122 2.123997 0.005247 0.020938 1.778145
18 584.0 2770 0.632852 0.010101 1753.0 1017.0 0.029812 0.004721 1.989938 0.034937 0.134058 1.778145
19 589.0 2553 0.624363 0.009310 1594.0 959.0 0.027108 0.004452 1.958627 0.008488 0.031311 1.778145
20 594.0 2772 0.613997 0.010108 1702.0 1070.0 0.028945 0.004967 1.920981 0.010366 0.037646 1.778145
21 599.0 2517 0.571712 0.009178 1439.0 1078.0 0.024472 0.005004 1.773354 0.042285 0.147627 1.778145
22 604.0 2636 0.554628 0.009612 1462.0 1174.0 0.024863 0.005450 1.716037 0.017084 0.057317 1.778145
23 609.0 2657 0.488521 0.009689 1298.0 1359.0 0.022074 0.006308 1.503908 0.066107 0.212129 1.778145
24 614.0 2692 0.495542 0.009816 1334.0 1358.0 0.022686 0.006304 1.525825 0.007021 0.021917 1.778145
25 619.0 2565 0.445614 0.009353 1143.0 1422.0 0.019438 0.006601 1.372414 0.049928 0.153411 1.778145
26 624.0 2845 0.387698 0.010374 1103.0 1742.0 0.018758 0.008086 1.199896 0.057916 0.172517 1.778145
27 629.0 2723 0.369078 0.009929 1005.0 1718.0 0.017091 0.007975 1.145239 0.018619 0.054658 1.778145
28 634.0 3175 0.319055 0.011578 1013.0 2162.0 0.017227 0.010036 0.999385 0.050023 0.145854 1.778145
29 639.0 3013 0.281779 0.010987 849.0 2164.0 0.014438 0.010045 0.890920 0.037276 0.108466 1.778145
30 644.0 3623 0.248413 0.013211 900.0 2723.0 0.015306 0.012640 0.793406 0.033366 0.097514 1.778145
31 649.0 3780 0.199471 0.013784 754.0 3026.0 0.012823 0.014046 0.648617 0.048942 0.144788 1.778145
32 654.0 4347 0.159650 0.015851 694.0 3653.0 0.011802 0.016957 0.528290 0.039821 0.120327 1.778145
33 659.0 4949 0.121237 0.018047 600.0 4349.0 0.010204 0.020187 0.409093 0.038414 0.119197 1.778145
34 664.0 5470 0.108775 0.019946 595.0 4875.0 0.010119 0.022629 0.369601 0.012461 0.039492 1.778145
35 669.0 5999 0.081014 0.021875 486.0 5513.0 0.008265 0.025590 0.279882 0.027762 0.089720 1.778145
36 674.0 6961 0.065508 0.025383 456.0 6505.0 0.007755 0.030195 0.228588 0.015506 0.051294 1.778145
37 679.0 7069 0.049370 0.025777 349.0 6720.0 0.005935 0.031193 0.174182 0.016137 0.054406 1.778145
38 684.0 8107 0.043296 0.029562 351.0 7756.0 0.005969 0.036002 0.153408 0.006075 0.020773 1.778145
39 689.0 8036 0.035590 0.029303 286.0 7750.0 0.004864 0.035974 0.126810 0.007706 0.026598 1.778145
40 694.0 8714 0.031099 0.031776 271.0 8443.0 0.004609 0.039191 0.111179 0.004490 0.015631 1.778145
41 699.0 8472 0.022073 0.030893 187.0 8285.0 0.003180 0.038458 0.079451 0.009027 0.031728 1.778145
42 704.0 8625 0.022377 0.031451 193.0 8432.0 0.003282 0.039140 0.080527 0.000304 0.001076 1.778145
43 709.0 8537 0.019913 0.031130 170.0 8367.0 0.002891 0.038838 0.071798 0.002463 0.008729 1.778145
44 714.0 8304 0.017100 0.030281 142.0 8162.0 0.002415 0.037887 0.061791 0.002813 0.010007 1.778145
45 719.0 8215 0.014973 0.029956 123.0 8092.0 0.002092 0.037562 0.054193 0.002128 0.007597 1.778145
46 724.0 8010 0.012110 0.029209 97.0 7913.0 0.001650 0.036731 0.043931 0.002863 0.010262 1.778145
47 729.0 7246 0.012007 0.026423 87.0 7159.0 0.001480 0.033231 0.043560 0.000103 0.000371 1.778145
48 734.0 7276 0.011545 0.026532 84.0 7192.0 0.001429 0.033384 0.041900 0.000462 0.001660 1.778145
49 739.0 6271 0.009727 0.022867 61.0 6210.0 0.001037 0.028826 0.035355 0.001817 0.006545 1.778145
50 744.0 5907 0.008295 0.021540 49.0 5858.0 0.000833 0.027192 0.030185 0.001432 0.005170 1.778145
51 749.0 5128 0.007995 0.018699 41.0 5087.0 0.000697 0.023613 0.029101 0.000300 0.001084 1.778145
52 754.0 5041 0.005356 0.018382 27.0 5014.0 0.000459 0.023274 0.019537 0.002639 0.009564 1.778145
53 759.0 4716 0.007846 0.017197 37.0 4679.0 0.000629 0.021719 0.028559 0.002490 0.009023 1.778145
54 764.0 4137 0.006768 0.015086 28.0 4109.0 0.000476 0.019073 0.024659 0.001077 0.003901 1.778145
55 769.0 3954 0.007334 0.014418 29.0 3925.0 0.000493 0.018219 0.026709 0.000566 0.002050 1.778145
56 774.0 3566 0.006169 0.013003 22.0 3544.0 0.000374 0.016451 0.022488 0.001165 0.004221 1.778145
57 779.0 3711 0.008084 0.013532 30.0 3681.0 0.000510 0.017087 0.029422 0.001915 0.006934 1.778145
58 784.0 3385 0.006204 0.012343 21.0 3364.0 0.000357 0.015615 0.022613 0.001880 0.006809 1.778145
59 789.0 2942 0.005438 0.010728 16.0 2926.0 0.000272 0.013582 0.019836 0.000765 0.002777 1.778145
60 794.0 2908 0.007909 0.010604 23.0 2885.0 0.000391 0.013392 0.028789 0.002471 0.008954 1.778145
61 799.0 2423 0.007429 0.008836 18.0 2405.0 0.000306 0.011164 0.027051 0.000480 0.001738 1.778145
62 804.0 2224 0.008094 0.008110 18.0 2206.0 0.000306 0.010240 0.029456 0.000665 0.002405 1.778145
63 809.0 2019 0.006439 0.007362 13.0 2006.0 0.000221 0.009312 0.023465 0.001655 0.005991 1.778145
64 814.0 1553 0.007083 0.005663 11.0 1542.0 0.000187 0.007158 0.025800 0.000644 0.002334 1.778145
65 819.0 1321 0.010598 0.004817 14.0 1307.0 0.000238 0.006067 0.038493 0.003515 0.012694 1.778145
66 824.0 922 0.003254 0.003362 3.0 919.0 0.000051 0.004266 0.011889 0.007344 0.026604 1.778145
67 829.0 729 0.008230 0.002658 6.0 723.0 0.000102 0.003356 0.029951 0.004977 0.018062 1.778145
68 834.0 423 0.009456 0.001542 4.0 419.0 0.000068 0.001945 0.034378 0.001226 0.004427 1.778145
69 839.0 216 0.009259 0.000788 2.0 214.0 0.000034 0.000993 0.033667 0.000197 0.000711 1.778145
70 844.0 115 0.000000 0.000419 0.0 115.0 0.000000 0.000534 0.000000 0.009259 0.033667 1.778145
71 850.0 54 0.037037 0.000197 2.0 52.0 0.000034 0.000241 0.131827 0.037037 0.131827 1.778145
In [1416]:
plot_by_woe(df_temp.iloc[1: 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1418]:
# We create the following categories: '< 520', '520-550', '550-580', '580-610', '610-640', '640-670', '> 670.
# '> 670' will be the reference category
df_inputs_prepr['last_fico_range_high:<=520'] = np.where((df_inputs_prepr['last_fico_range_high'] <= 520), 1, 0)
df_inputs_prepr['last_fico_range_high:520-550'] = np.where((df_inputs_prepr['last_fico_range_high'] > 520) & (df_inputs_prepr['last_fico_range_high'] <= 550), 1, 0)
df_inputs_prepr['last_fico_range_high:550-580'] = np.where((df_inputs_prepr['last_fico_range_high'] > 550) & (df_inputs_prepr['last_fico_range_high'] <= 580), 1, 0)
df_inputs_prepr['last_fico_range_high:580-610'] = np.where((df_inputs_prepr['last_fico_range_high'] > 580) & (df_inputs_prepr['last_fico_range_high'] <= 610), 1, 0)
df_inputs_prepr['last_fico_range_high:610-640'] = np.where((df_inputs_prepr['last_fico_range_high'] > 610) & (df_inputs_prepr['last_fico_range_high'] <= 640), 1, 0)
df_inputs_prepr['last_fico_range_high:640-670'] = np.where((df_inputs_prepr['last_fico_range_high'] > 640) & (df_inputs_prepr['last_fico_range_high'] <= 670), 1, 0)
df_inputs_prepr['last_fico_range_high:>670'] = np.where((df_inputs_prepr['last_fico_range_high'] > 670), 1, 0)

Variable: 'mo_sin_rcnt_rev_tl_op'¶

In [1419]:
df_inputs_prepr['mo_sin_rcnt_rev_tl_op'].nunique()
Out[1419]:
224
In [1422]:
# 'mo_sin_rcnt_rev_tl_op'
# the categories of everyone with 'mo_sin_rcnt_rev_tl_op' less or equal 140.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 150., : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mo_sin_rcnt_rev_tl_op_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3692458730.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3692458730.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1422]:
mo_sin_rcnt_rev_tl_op_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.15, 3.0] 63052 0.247066 0.230036 15578.0 47474.0 0.265094 0.220469 0.789553 NaN NaN 0.011909
1 (3.0, 6.0] 48792 0.228234 0.178011 11136.0 37656.0 0.189504 0.174874 0.734125 0.018832 0.055428 0.011909
2 (6.0, 9.0] 49393 0.202822 0.180203 10018.0 39375.0 0.170479 0.182857 0.658713 0.025412 0.075412 0.011909
3 (9.0, 12.0] 26416 0.217633 0.096375 5749.0 20667.0 0.097832 0.095977 0.702763 0.014811 0.044049 0.011909
4 (12.0, 15.0] 19549 0.214282 0.071322 4189.0 15360.0 0.071285 0.071332 0.692821 0.003351 0.009942 0.011909
5 (15.0, 18.0] 13643 0.198197 0.049775 2704.0 10939.0 0.046015 0.050801 0.644895 0.016085 0.047925 0.011909
6 (18.0, 21.0] 10149 0.200414 0.037027 2034.0 8115.0 0.034613 0.037686 0.651522 0.002217 0.006627 0.011909
7 (21.0, 24.0] 8080 0.185767 0.029479 1501.0 6579.0 0.025543 0.030553 0.607602 0.014647 0.043921 0.011909
8 (24.0, 27.0] 6122 0.183110 0.022335 1121.0 5001.0 0.019076 0.023225 0.599596 0.002657 0.008005 0.011909
9 (27.0, 30.0] 4560 0.174561 0.016637 796.0 3764.0 0.013546 0.017480 0.573759 0.008549 0.025837 0.011909
10 (30.0, 33.0] 3609 0.178720 0.013167 645.0 2964.0 0.010976 0.013765 0.586344 0.004158 0.012585 0.011909
11 (33.0, 36.0] 2999 0.166722 0.010941 500.0 2499.0 0.008509 0.011605 0.549948 0.011998 0.036395 0.011909
12 (36.0, 39.0] 2478 0.173123 0.009041 429.0 2049.0 0.007300 0.009516 0.569400 0.006401 0.019452 0.011909
13 (39.0, 42.0] 1960 0.167347 0.007151 328.0 1632.0 0.005582 0.007579 0.551850 0.005777 0.017550 0.011909
14 (42.0, 45.0] 1642 0.163216 0.005991 268.0 1374.0 0.004561 0.006381 0.539259 0.004131 0.012591 0.011909
15 (45.0, 48.0] 1374 0.155750 0.005013 214.0 1160.0 0.003642 0.005387 0.516416 0.007466 0.022843 0.011909
16 (48.0, 51.0] 1191 0.161209 0.004345 192.0 999.0 0.003267 0.004639 0.533131 0.005459 0.016715 0.011909
17 (51.0, 54.0] 1013 0.151037 0.003696 153.0 860.0 0.002604 0.003994 0.501935 0.010173 0.031196 0.011909
18 (54.0, 57.0] 862 0.160093 0.003145 138.0 724.0 0.002348 0.003362 0.529718 0.009056 0.027784 0.011909
19 (57.0, 60.0] 766 0.159269 0.002795 122.0 644.0 0.002076 0.002991 0.527198 0.000824 0.002520 0.011909
20 (60.0, 63.0] 696 0.122126 0.002539 85.0 611.0 0.001446 0.002837 0.411958 0.037142 0.115240 0.011909
21 (63.0, 66.0] 619 0.150242 0.002258 93.0 526.0 0.001583 0.002443 0.499489 0.028116 0.087532 0.011909
22 (66.0, 69.0] 547 0.166362 0.001996 91.0 456.0 0.001549 0.002118 0.548851 0.016120 0.049362 0.011909
23 (69.0, 72.0] 548 0.135036 0.001999 74.0 474.0 0.001259 0.002201 0.452394 0.031325 0.096457 0.011909
24 (72.0, 75.0] 486 0.144033 0.001773 70.0 416.0 0.001191 0.001932 0.480324 0.008996 0.027929 0.011909
25 (75.0, 78.0] 443 0.155756 0.001616 69.0 374.0 0.001174 0.001737 0.516436 0.011723 0.036112 0.011909
26 (78.0, 81.0] 350 0.134286 0.001277 47.0 303.0 0.000800 0.001407 0.450055 0.021470 0.066381 0.011909
27 (81.0, 84.0] 334 0.113772 0.001219 38.0 296.0 0.000647 0.001375 0.385551 0.020513 0.064504 0.011909
28 (84.0, 87.0] 333 0.156156 0.001215 52.0 281.0 0.000885 0.001305 0.517663 0.042384 0.132112 0.011909
29 (87.0, 90.0] 294 0.142857 0.001073 42.0 252.0 0.000715 0.001170 0.476685 0.013299 0.040978 0.011909
30 (90.0, 93.0] 249 0.148594 0.000908 37.0 212.0 0.000630 0.000985 0.494412 0.005737 0.017727 0.011909
31 (93.0, 96.0] 221 0.167421 0.000806 37.0 184.0 0.000630 0.000854 0.552075 0.018826 0.057664 0.011909
32 (96.0, 99.0] 175 0.142857 0.000638 25.0 150.0 0.000425 0.000697 0.476685 0.024564 0.075390 0.011909
33 (99.0, 102.0] 171 0.116959 0.000624 20.0 151.0 0.000340 0.000701 0.395647 0.025898 0.081038 0.011909
34 (102.0, 105.0] 154 0.168831 0.000562 26.0 128.0 0.000442 0.000594 0.556366 0.051872 0.160719 0.011909
35 (105.0, 108.0] 146 0.164384 0.000533 24.0 122.0 0.000408 0.000567 0.542822 0.004448 0.013544 0.011909
36 (108.0, 111.0] 109 0.165138 0.000398 18.0 91.0 0.000306 0.000423 0.545121 0.000754 0.002299 0.011909
37 (111.0, 114.0] 91 0.120879 0.000332 11.0 80.0 0.000187 0.000372 0.408027 0.044258 0.137093 0.011909
38 (114.0, 117.0] 81 0.209877 0.000296 17.0 64.0 0.000289 0.000297 0.679729 0.088997 0.271702 0.011909
39 (117.0, 120.0] 67 0.149254 0.000244 10.0 57.0 0.000170 0.000265 0.496444 0.060623 0.183285 0.011909
40 (120.0, 123.0] 55 0.200000 0.000201 11.0 44.0 0.000187 0.000204 0.650286 0.050746 0.153842 0.011909
41 (123.0, 126.0] 50 0.200000 0.000182 10.0 40.0 0.000170 0.000186 0.650286 0.000000 0.000000 0.011909
42 (126.0, 129.0] 50 0.180000 0.000182 9.0 41.0 0.000153 0.000190 0.590212 0.020000 0.060074 0.011909
43 (129.0, 132.0] 33 0.151515 0.000120 5.0 28.0 0.000085 0.000130 0.503407 0.028485 0.086804 0.011909
44 (132.0, 135.0] 32 0.093750 0.000117 3.0 29.0 0.000051 0.000135 0.321410 0.057765 0.181997 0.011909
45 (135.0, 138.0] 31 0.290323 0.000113 9.0 22.0 0.000153 0.000102 0.915912 0.196573 0.594502 0.011909
46 (138.0, 141.0] 33 0.090909 0.000120 3.0 30.0 0.000051 0.000139 0.312205 0.199413 0.603707 0.011909
47 (141.0, 144.0] 16 0.437500 0.000058 7.0 9.0 0.000119 0.000042 1.348087 0.346591 1.035881 0.011909
48 (144.0, 147.0] 19 0.263158 0.000069 5.0 14.0 0.000085 0.000065 0.836683 0.174342 0.511403 0.011909
49 (147.0, 150.0] 13 0.076923 0.000047 1.0 12.0 0.000017 0.000056 0.266481 0.186235 0.570202 0.011909
In [1423]:
plot_by_woe(df_temp.iloc[13 : 47, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1425]:
# We create the following categories: '0-3', '3-6', '6-9', '9-20', '20-37', '37-63', '63-80', '80-140', '> 140'.
# '> 140' will be the reference category
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:0-3'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 3), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:3-6'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 3) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 6), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:6-9'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 6) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 9), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:9-20'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 9) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 20), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:20-37'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 20) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 37), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:37-63'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 37) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 63), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:63-80'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 63) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 80), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:80-140'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 80) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 140), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:>140'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 140), 1, 0)

Variable: 'mo_sin_rcnt_tl'¶

In [1426]:
df_inputs_prepr['mo_sin_rcnt_tl'].nunique()
Out[1426]:
153
In [1427]:
# 'mo_sin_rcnt_tl'
# the categories of everyone with 'mo_sin_rcnt_tl' less or equal 50.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mo_sin_rcnt_tl'] <= 50, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['mo_sin_rcnt_tl_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_tl'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mo_sin_rcnt_tl_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2033739020.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['mo_sin_rcnt_tl_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_tl'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2033739020.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['mo_sin_rcnt_tl_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_tl'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1427]:
mo_sin_rcnt_tl_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.05, 1.0] 27045 0.263154 0.099212 7117.0 19928.0 0.121571 0.093097 0.835452 NaN NaN 0.014119
1 (1.0, 2.0] 29629 0.240845 0.108691 7136.0 22493.0 0.121895 0.105079 0.770122 0.022309 0.065330 0.014119
2 (2.0, 3.0] 28294 0.234997 0.103793 6649.0 21645.0 0.113577 0.101118 0.752929 0.005848 0.017194 0.014119
3 (3.0, 4.0] 24875 0.232523 0.091251 5784.0 19091.0 0.098801 0.089187 0.745645 0.002474 0.007284 0.014119
4 (4.0, 5.0] 21449 0.218985 0.078683 4697.0 16752.0 0.080233 0.078260 0.705677 0.013538 0.039968 0.014119
5 (5.0, 6.0] 32034 0.192296 0.117513 6160.0 25874.0 0.105224 0.120874 0.626217 0.026689 0.079460 0.014119
6 (6.0, 7.0] 16581 0.216875 0.060826 3596.0 12985.0 0.061426 0.060661 0.699429 0.024579 0.073213 0.014119
7 (7.0, 8.0] 13901 0.206028 0.050994 2864.0 11037.0 0.048922 0.051561 0.667224 0.010846 0.032205 0.014119
8 (8.0, 9.0] 11458 0.200820 0.042032 2301.0 9157.0 0.039305 0.042778 0.651705 0.005208 0.015519 0.014119
9 (9.0, 10.0] 9786 0.199980 0.035899 1957.0 7829.0 0.033429 0.036574 0.649196 0.000841 0.002509 0.014119
10 (10.0, 11.0] 8448 0.194010 0.030991 1639.0 6809.0 0.027997 0.031809 0.631352 0.005969 0.017843 0.014119
11 (11.0, 12.0] 7155 0.195667 0.026247 1400.0 5755.0 0.023914 0.026885 0.636311 0.001657 0.004958 0.014119
12 (12.0, 13.0] 6205 0.191942 0.022762 1191.0 5014.0 0.020344 0.023424 0.625157 0.003725 0.011154 0.014119
13 (13.0, 14.0] 5106 0.189581 0.018731 968.0 4138.0 0.016535 0.019331 0.618076 0.002361 0.007080 0.014119
14 (14.0, 15.0] 3988 0.181294 0.014630 723.0 3265.0 0.012350 0.015253 0.593154 0.008287 0.024923 0.014119
15 (15.0, 16.0] 3402 0.165197 0.012480 562.0 2840.0 0.009600 0.013267 0.544397 0.016097 0.048757 0.014119
16 (16.0, 17.0] 2879 0.176103 0.010561 507.0 2372.0 0.008660 0.011081 0.577482 0.010906 0.033085 0.014119
17 (17.0, 18.0] 2458 0.155004 0.009017 381.0 2077.0 0.006508 0.009703 0.513263 0.021099 0.064219 0.014119
18 (18.0, 19.0] 2193 0.169175 0.008045 371.0 1822.0 0.006337 0.008512 0.556490 0.014171 0.043227 0.014119
19 (19.0, 20.0] 1788 0.154922 0.006559 277.0 1511.0 0.004732 0.007059 0.513011 0.014253 0.043480 0.014119
20 (20.0, 21.0] 1601 0.185509 0.005873 297.0 1304.0 0.005073 0.006092 0.605845 0.030587 0.092834 0.014119
21 (21.0, 22.0] 1437 0.176061 0.005271 253.0 1184.0 0.004322 0.005531 0.577356 0.009448 0.028488 0.014119
22 (22.0, 23.0] 1237 0.178658 0.004538 221.0 1016.0 0.003775 0.004746 0.585202 0.002597 0.007846 0.014119
23 (23.0, 24.0] 1164 0.152062 0.004270 177.0 987.0 0.003023 0.004611 0.504236 0.026596 0.080967 0.014119
24 (24.0, 25.0] 907 0.158765 0.003327 144.0 763.0 0.002460 0.003564 0.524776 0.006703 0.020541 0.014119
25 (25.0, 26.0] 840 0.167857 0.003081 141.0 699.0 0.002409 0.003265 0.552488 0.009092 0.027712 0.014119
26 (26.0, 27.0] 763 0.140236 0.002799 107.0 656.0 0.001828 0.003065 0.467755 0.027621 0.084733 0.014119
27 (27.0, 28.0] 658 0.165653 0.002414 109.0 549.0 0.001862 0.002565 0.545787 0.025418 0.078032 0.014119
28 (28.0, 29.0] 521 0.145873 0.001911 76.0 445.0 0.001298 0.002079 0.485185 0.019780 0.060602 0.014119
29 (29.0, 30.0] 475 0.162105 0.001742 77.0 398.0 0.001315 0.001859 0.534976 0.016232 0.049791 0.014119
30 (30.0, 31.0] 477 0.161426 0.001750 77.0 400.0 0.001315 0.001869 0.532902 0.000680 0.002074 0.014119
31 (31.0, 32.0] 440 0.150000 0.001614 66.0 374.0 0.001127 0.001747 0.497898 0.011426 0.035004 0.014119
32 (32.0, 33.0] 388 0.164948 0.001423 64.0 324.0 0.001093 0.001514 0.543641 0.014948 0.045743 0.014119
33 (33.0, 34.0] 364 0.192308 0.001335 70.0 294.0 0.001196 0.001373 0.626253 0.027359 0.082612 0.014119
34 (34.0, 35.0] 294 0.108844 0.001079 32.0 262.0 0.000547 0.001224 0.369210 0.083464 0.257043 0.014119
35 (35.0, 36.0] 275 0.156364 0.001009 43.0 232.0 0.000735 0.001084 0.517428 0.047520 0.148218 0.014119
36 (36.0, 37.0] 258 0.124031 0.000946 32.0 226.0 0.000547 0.001056 0.417216 0.032333 0.100212 0.014119
37 (37.0, 38.0] 219 0.136986 0.000803 30.0 189.0 0.000512 0.000883 0.457673 0.012955 0.040457 0.014119
38 (38.0, 39.0] 187 0.160428 0.000686 30.0 157.0 0.000512 0.000733 0.529856 0.023442 0.072184 0.014119
39 (39.0, 40.0] 182 0.159341 0.000668 29.0 153.0 0.000495 0.000715 0.526535 0.001087 0.003321 0.014119
40 (40.0, 41.0] 159 0.100629 0.000583 16.0 143.0 0.000273 0.000668 0.342962 0.058712 0.183573 0.014119
41 (41.0, 42.0] 162 0.172840 0.000594 28.0 134.0 0.000478 0.000626 0.567606 0.072211 0.224644 0.014119
42 (42.0, 43.0] 152 0.125000 0.000558 19.0 133.0 0.000325 0.000621 0.420257 0.047840 0.147349 0.014119
43 (43.0, 44.0] 128 0.140625 0.000470 18.0 110.0 0.000307 0.000514 0.468960 0.015625 0.048703 0.014119
44 (44.0, 45.0] 116 0.112069 0.000426 13.0 103.0 0.000222 0.000481 0.379461 0.028556 0.089500 0.014119
45 (45.0, 46.0] 121 0.206612 0.000444 25.0 96.0 0.000427 0.000448 0.668960 0.094543 0.289499 0.014119
46 (46.0, 47.0] 113 0.168142 0.000415 19.0 94.0 0.000325 0.000439 0.553352 0.038470 0.115607 0.014119
47 (47.0, 48.0] 88 0.136364 0.000323 12.0 76.0 0.000205 0.000355 0.455738 0.031778 0.097614 0.014119
48 (48.0, 49.0] 93 0.182796 0.000341 17.0 76.0 0.000290 0.000355 0.597679 0.046432 0.141941 0.014119
49 (49.0, 50.0] 106 0.188679 0.000389 20.0 86.0 0.000342 0.000402 0.615370 0.005884 0.017691 0.014119
In [1428]:
plot_by_woe(df_temp.iloc[ 8: , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1430]:
# We create the following categories: '0-2', '2-5', '5-6', '6-10', '10-15', '15-20', '20-50', '> 50'.
# '> 50' will be the reference category
df_inputs_prepr['mo_sin_rcnt_tl:0-2'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] <= 2), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:2-5'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 2) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 5), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:5-6'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 5) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 6), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:6-10'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 6) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 10), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:10-15'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 10) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 15), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:15-20'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 15) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 20), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:20-50'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 20) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 50), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:>50'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 50), 1, 0)

Variable: 'mths_since_rcnt_il'¶

In [1431]:
df_inputs_prepr['mths_since_rcnt_il'].nunique()
Out[1431]:
271
In [1432]:
df_inputs_prepr.loc[df_inputs_prepr['mths_since_rcnt_il'] >= 999., : ]['mths_since_rcnt_il'].count()
Out[1432]:
164353

There are 660399 missing value filled with '999'

In [1433]:
# 'mths_since_rcnt_il'
# the categories of everyone with 'mths_since_rcnt_il' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mths_since_rcnt_il'] <= 100, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['mths_since_rcnt_il_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_rcnt_il'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_rcnt_il_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3212778370.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['mths_since_rcnt_il_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_rcnt_il'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3212778370.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['mths_since_rcnt_il_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_rcnt_il'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1433]:
mths_since_rcnt_il_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.1, 2.0] 7109 0.293572 0.066417 2087.0 5022.0 0.076669 0.062921 0.796832 NaN NaN 0.005064
1 (2.0, 4.0] 11130 0.277987 0.103984 3094.0 8036.0 0.113662 0.100683 0.755612 0.015584 0.041220 0.005064
2 (4.0, 6.0] 10576 0.265696 0.098808 2810.0 7766.0 0.103229 0.097300 0.723160 0.012292 0.032451 0.005064
3 (6.0, 8.0] 10703 0.268429 0.099994 2873.0 7830.0 0.105544 0.098102 0.730374 0.002733 0.007213 0.005064
4 (8.0, 10.0] 9182 0.258332 0.085784 2372.0 6810.0 0.087139 0.085322 0.703735 0.010098 0.026639 0.005064
5 (10.0, 12.0] 8133 0.249477 0.075984 2029.0 6104.0 0.074538 0.076477 0.680390 0.008854 0.023344 0.005064
6 (12.0, 14.0] 7957 0.244565 0.074339 1946.0 6011.0 0.071489 0.075312 0.667440 0.004913 0.012950 0.005064
7 (14.0, 16.0] 6037 0.235547 0.056402 1422.0 4615.0 0.052239 0.057821 0.643673 0.009017 0.023768 0.005064
8 (16.0, 18.0] 5003 0.244453 0.046741 1223.0 3780.0 0.044929 0.047360 0.667147 0.008906 0.023474 0.005064
9 (18.0, 20.0] 4167 0.236621 0.038931 986.0 3181.0 0.036222 0.039855 0.646503 0.007832 0.020644 0.005064
10 (20.0, 22.0] 3632 0.238987 0.033933 868.0 2764.0 0.031887 0.034630 0.652738 0.002366 0.006236 0.005064
11 (22.0, 24.0] 2937 0.236636 0.027439 695.0 2242.0 0.025532 0.028090 0.646542 0.002351 0.006196 0.005064
12 (24.0, 26.0] 2438 0.249795 0.022777 609.0 1829.0 0.022372 0.022915 0.681227 0.013159 0.034685 0.005064
13 (26.0, 28.0] 1965 0.240712 0.018358 473.0 1492.0 0.017376 0.018693 0.657287 0.009082 0.023940 0.005064
14 (28.0, 30.0] 1734 0.233564 0.016200 405.0 1329.0 0.014878 0.016651 0.638444 0.007148 0.018843 0.005064
15 (30.0, 32.0] 1541 0.229721 0.014397 354.0 1187.0 0.013005 0.014872 0.628313 0.003843 0.010131 0.005064
16 (32.0, 34.0] 1349 0.235730 0.012603 318.0 1031.0 0.011682 0.012917 0.644154 0.006009 0.015841 0.005064
17 (34.0, 36.0] 1161 0.235142 0.010847 273.0 888.0 0.010029 0.011126 0.642604 0.000588 0.001550 0.005064
18 (36.0, 38.0] 919 0.250272 0.008586 230.0 689.0 0.008449 0.008632 0.682485 0.015130 0.039881 0.005064
19 (38.0, 40.0] 812 0.242611 0.007586 197.0 615.0 0.007237 0.007705 0.662291 0.007661 0.020194 0.005064
20 (40.0, 42.0] 751 0.222370 0.007016 167.0 584.0 0.006135 0.007317 0.608930 0.020241 0.053360 0.005064
21 (42.0, 44.0] 689 0.217707 0.006437 150.0 539.0 0.005510 0.006753 0.596629 0.004663 0.012301 0.005064
22 (44.0, 46.0] 576 0.218750 0.005381 126.0 450.0 0.004629 0.005638 0.599381 0.001043 0.002752 0.005064
23 (46.0, 48.0] 540 0.242593 0.005045 131.0 409.0 0.004812 0.005124 0.662242 0.023843 0.062862 0.005064
24 (48.0, 50.0] 494 0.238866 0.004615 118.0 376.0 0.004335 0.004711 0.652421 0.003726 0.009822 0.005064
25 (50.0, 52.0] 417 0.211031 0.003896 88.0 329.0 0.003233 0.004122 0.579011 0.027835 0.073410 0.005064
26 (52.0, 54.0] 428 0.224299 0.003999 96.0 332.0 0.003527 0.004160 0.614017 0.013268 0.035006 0.005064
27 (54.0, 56.0] 351 0.210826 0.003279 74.0 277.0 0.002718 0.003471 0.578470 0.013473 0.035547 0.005064
28 (56.0, 58.0] 334 0.251497 0.003120 84.0 250.0 0.003086 0.003132 0.685714 0.040671 0.107244 0.005064
29 (58.0, 60.0] 318 0.223270 0.002971 71.0 247.0 0.002608 0.003095 0.611304 0.028227 0.074410 0.005064
30 (60.0, 62.0] 311 0.254019 0.002906 79.0 232.0 0.002902 0.002907 0.692364 0.030749 0.081060 0.005064
31 (62.0, 64.0] 238 0.218487 0.002224 52.0 186.0 0.001910 0.002330 0.598688 0.035532 0.093676 0.005064
32 (64.0, 66.0] 292 0.222603 0.002728 65.0 227.0 0.002388 0.002844 0.609543 0.004115 0.010855 0.005064
33 (66.0, 68.0] 258 0.267442 0.002410 69.0 189.0 0.002535 0.002368 0.727768 0.044839 0.118224 0.005064
34 (68.0, 70.0] 214 0.275701 0.001999 59.0 155.0 0.002167 0.001942 0.749572 0.008259 0.021804 0.005064
35 (70.0, 72.0] 194 0.242268 0.001812 47.0 147.0 0.001727 0.001842 0.661387 0.033433 0.088185 0.005064
36 (72.0, 74.0] 186 0.209677 0.001738 39.0 147.0 0.001433 0.001842 0.575437 0.032591 0.085950 0.005064
37 (74.0, 76.0] 172 0.244186 0.001607 42.0 130.0 0.001543 0.001629 0.666443 0.034509 0.091006 0.005064
38 (76.0, 78.0] 185 0.205405 0.001728 38.0 147.0 0.001396 0.001842 0.564154 0.038781 0.102288 0.005064
39 (78.0, 80.0] 154 0.201299 0.001439 31.0 123.0 0.001139 0.001541 0.553303 0.004107 0.010851 0.005064
40 (80.0, 82.0] 174 0.235632 0.001626 41.0 133.0 0.001506 0.001666 0.643896 0.034333 0.090593 0.005064
41 (82.0, 84.0] 147 0.258503 0.001373 38.0 109.0 0.001396 0.001366 0.704188 0.022871 0.060292 0.005064
42 (84.0, 86.0] 127 0.236220 0.001187 30.0 97.0 0.001102 0.001215 0.645447 0.022283 0.058741 0.005064
43 (86.0, 88.0] 141 0.241135 0.001317 34.0 107.0 0.001249 0.001341 0.658400 0.004914 0.012953 0.005064
44 (88.0, 90.0] 128 0.210938 0.001196 27.0 101.0 0.000992 0.001265 0.578764 0.030197 0.079636 0.005064
45 (90.0, 92.0] 152 0.236842 0.001420 36.0 116.0 0.001323 0.001453 0.647085 0.025905 0.068322 0.005064
46 (92.0, 94.0] 122 0.204918 0.001140 25.0 97.0 0.000918 0.001215 0.562867 0.031924 0.084218 0.005064
47 (94.0, 96.0] 160 0.200000 0.001495 32.0 128.0 0.001176 0.001604 0.549870 0.004918 0.012997 0.005064
48 (96.0, 98.0] 163 0.251534 0.001523 41.0 122.0 0.001506 0.001529 0.685811 0.051534 0.135941 0.005064
49 (98.0, 100.0] 135 0.200000 0.001261 27.0 108.0 0.000992 0.001353 0.549870 0.051534 0.135941 0.005064
In [1434]:
plot_by_woe(df_temp.iloc[5 : 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1436]:
# We create the following categories: 'Missig','0-4', '4-10', '10-20', '20-40', '40-100', '> 100'.
# 'Missing' will be the reference category.
df_inputs_prepr['mths_since_rcnt_il:0-4'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] <= 4), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:4-10'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 4) & (df_inputs_prepr['mths_since_rcnt_il'] <= 10), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:10-20'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 10) & (df_inputs_prepr['mths_since_rcnt_il'] <= 20), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:20-40'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 20) & (df_inputs_prepr['mths_since_rcnt_il'] <= 40), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:40-100'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 40) & (df_inputs_prepr['mths_since_rcnt_il'] <= 100), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:>100'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 100) & (df_inputs_prepr['mths_since_rcnt_il'] <= 700), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:Missing'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] == 999), 1, 0)

Variable: 'mths_since_recent_bc'¶

In [1437]:
df_inputs_prepr['mths_since_recent_bc'].nunique()
Out[1437]:
388
In [1438]:
# 'mths_since_rcnt_il'
# the categories of everyone with 'mths_since_rcnt_il' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mths_since_recent_bc'] <= 200, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['mths_since_recent_bc_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_bc'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_recent_bc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3184696751.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['mths_since_recent_bc_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_bc'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3184696751.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['mths_since_recent_bc_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_bc'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1438]:
mths_since_recent_bc_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.2, 4.0] 49141 0.250300 0.179673 12300.0 36841.0 0.209647 0.171487 0.798642 NaN NaN 0.012995
1 (4.0, 8.0] 42666 0.231425 0.155999 9874.0 32792.0 0.168297 0.152640 0.743163 0.018875 0.055480 0.012995
2 (8.0, 12.0] 33383 0.229967 0.122058 7677.0 25706.0 0.130851 0.119656 0.738863 0.001458 0.004300 0.012995
3 (12.0, 16.0] 38503 0.207542 0.140778 7991.0 30512.0 0.136202 0.142027 0.672428 0.022425 0.066435 0.012995
4 (16.0, 20.0] 19111 0.213594 0.069875 4082.0 15029.0 0.069576 0.069957 0.690418 0.006052 0.017989 0.012995
5 (20.0, 24.0] 15002 0.203906 0.054852 3059.0 11943.0 0.052139 0.055592 0.661596 0.009688 0.028821 0.012995
6 (24.0, 28.0] 11501 0.203113 0.042051 2336.0 9165.0 0.039816 0.042661 0.659231 0.000793 0.002366 0.012995
7 (28.0, 32.0] 8800 0.198068 0.032175 1743.0 7057.0 0.029709 0.032849 0.644167 0.005045 0.015064 0.012995
8 (32.0, 36.0] 7308 0.193487 0.026720 1414.0 5894.0 0.024101 0.027435 0.630452 0.004582 0.013714 0.012995
9 (36.0, 40.0] 5947 0.183454 0.021744 1091.0 4856.0 0.018596 0.022604 0.600306 0.010033 0.030147 0.012995
10 (40.0, 44.0] 4767 0.189218 0.017429 902.0 3865.0 0.015374 0.017991 0.617645 0.005764 0.017339 0.012995
11 (44.0, 48.0] 4078 0.172388 0.014910 703.0 3375.0 0.011982 0.015710 0.566857 0.016829 0.050787 0.012995
12 (48.0, 52.0] 3450 0.178551 0.012614 616.0 2834.0 0.010499 0.013192 0.585512 0.006162 0.018654 0.012995
13 (52.0, 56.0] 2966 0.196898 0.010845 584.0 2382.0 0.009954 0.011088 0.640667 0.018347 0.055156 0.012995
14 (56.0, 60.0] 2583 0.160279 0.009444 414.0 2169.0 0.007056 0.010096 0.529989 0.036619 0.110678 0.012995
15 (60.0, 64.0] 2329 0.165307 0.008515 385.0 1944.0 0.006562 0.009049 0.545333 0.005028 0.015344 0.012995
16 (64.0, 68.0] 2087 0.178246 0.007631 372.0 1715.0 0.006341 0.007983 0.584592 0.012939 0.039259 0.012995
17 (68.0, 72.0] 2055 0.152311 0.007514 313.0 1742.0 0.005335 0.008109 0.505569 0.025935 0.079022 0.012995
18 (72.0, 76.0] 1817 0.162356 0.006643 295.0 1522.0 0.005028 0.007085 0.536333 0.010044 0.030763 0.012995
19 (76.0, 80.0] 1640 0.160976 0.005996 264.0 1376.0 0.004500 0.006405 0.532119 0.001380 0.004214 0.012995
20 (80.0, 84.0] 1477 0.149628 0.005400 221.0 1256.0 0.003767 0.005846 0.497312 0.011348 0.034806 0.012995
21 (84.0, 88.0] 1471 0.148878 0.005378 219.0 1252.0 0.003733 0.005828 0.495004 0.000749 0.002308 0.012995
22 (88.0, 92.0] 1358 0.153903 0.004965 209.0 1149.0 0.003562 0.005348 0.510458 0.005024 0.015453 0.012995
23 (92.0, 96.0] 1129 0.169176 0.004128 191.0 938.0 0.003255 0.004366 0.557106 0.015273 0.046648 0.012995
24 (96.0, 100.0] 1045 0.143541 0.003821 150.0 895.0 0.002557 0.004166 0.478525 0.025636 0.078580 0.012995
25 (100.0, 104.0] 971 0.144181 0.003550 140.0 831.0 0.002386 0.003868 0.480506 0.000641 0.001981 0.012995
26 (104.0, 108.0] 891 0.173962 0.003258 155.0 736.0 0.002642 0.003426 0.571627 0.029781 0.091120 0.012995
27 (108.0, 112.0] 724 0.168508 0.002647 122.0 602.0 0.002079 0.002802 0.555075 0.005454 0.016552 0.012995
28 (112.0, 116.0] 688 0.164244 0.002516 113.0 575.0 0.001926 0.002677 0.542094 0.004264 0.012981 0.012995
29 (116.0, 120.0] 592 0.148649 0.002165 88.0 504.0 0.001500 0.002346 0.494297 0.015596 0.047797 0.012995
30 (120.0, 124.0] 503 0.143141 0.001839 72.0 431.0 0.001227 0.002006 0.477289 0.005507 0.017007 0.012995
31 (124.0, 128.0] 437 0.162471 0.001598 71.0 366.0 0.001210 0.001704 0.536686 0.019330 0.059397 0.012995
32 (128.0, 132.0] 368 0.157609 0.001346 58.0 310.0 0.000989 0.001443 0.521820 0.004863 0.014866 0.012995
33 (132.0, 136.0] 362 0.162983 0.001324 59.0 303.0 0.001006 0.001410 0.538249 0.005375 0.016428 0.012995
34 (136.0, 140.0] 276 0.163043 0.001009 45.0 231.0 0.000767 0.001075 0.538432 0.000060 0.000183 0.012995
35 (140.0, 144.0] 246 0.138211 0.000899 34.0 212.0 0.000580 0.000987 0.462005 0.024832 0.076427 0.012995
36 (144.0, 148.0] 233 0.145923 0.000852 34.0 199.0 0.000580 0.000926 0.485888 0.007711 0.023882 0.012995
37 (148.0, 152.0] 211 0.194313 0.000771 41.0 170.0 0.000699 0.000791 0.632928 0.048390 0.147040 0.012995
38 (152.0, 156.0] 190 0.221053 0.000695 42.0 148.0 0.000716 0.000689 0.712524 0.026740 0.079596 0.012995
39 (156.0, 160.0] 164 0.146341 0.000600 24.0 140.0 0.000409 0.000652 0.487180 0.074711 0.225344 0.012995
40 (160.0, 164.0] 167 0.215569 0.000611 36.0 131.0 0.000614 0.000610 0.696277 0.069227 0.209096 0.012995
41 (164.0, 168.0] 146 0.123288 0.000534 18.0 128.0 0.000307 0.000596 0.415367 0.092281 0.280910 0.012995
42 (168.0, 172.0] 130 0.192308 0.000475 25.0 105.0 0.000426 0.000489 0.626918 0.069020 0.211551 0.012995
43 (172.0, 176.0] 112 0.125000 0.000410 14.0 98.0 0.000239 0.000456 0.420748 0.067308 0.206171 0.012995
44 (176.0, 180.0] 110 0.145455 0.000402 16.0 94.0 0.000273 0.000438 0.484442 0.020455 0.063694 0.012995
45 (180.0, 184.0] 93 0.129032 0.000340 12.0 81.0 0.000205 0.000377 0.433388 0.016422 0.051054 0.012995
46 (184.0, 188.0] 67 0.164179 0.000245 11.0 56.0 0.000187 0.000261 0.541896 0.035147 0.108508 0.012995
47 (188.0, 192.0] 76 0.210526 0.000278 16.0 60.0 0.000273 0.000279 0.681304 0.046347 0.139409 0.012995
48 (192.0, 196.0] 76 0.092105 0.000278 7.0 69.0 0.000119 0.000321 0.315888 0.118421 0.365416 0.012995
49 (196.0, 200.0] 55 0.218182 0.000201 12.0 43.0 0.000205 0.000200 0.704023 0.126077 0.388135 0.012995
In [1439]:
plot_by_woe(df_temp.iloc[10 : , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1441]:
# We create the following categories: '0-12', '12-32', '32-52', '52-68', '68-100', '100-130', '> 130'.
# '> 130' will be the reference category.
df_inputs_prepr['mths_since_recent_bc:0-12'] = np.where((df_inputs_prepr['mths_since_recent_bc'] <= 12), 1, 0)
df_inputs_prepr['mths_since_recent_bc:12-32'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 12) & (df_inputs_prepr['mths_since_recent_bc'] <= 32), 1, 0)
df_inputs_prepr['mths_since_recent_bc:32-52'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 32) & (df_inputs_prepr['mths_since_recent_bc'] <= 52), 1, 0)
df_inputs_prepr['mths_since_recent_bc:52-68'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 52) & (df_inputs_prepr['mths_since_recent_bc'] <= 68), 1, 0)
df_inputs_prepr['mths_since_recent_bc:68-100'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 68) & (df_inputs_prepr['mths_since_recent_bc'] <= 100), 1, 0)
df_inputs_prepr['mths_since_recent_bc:100-130'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 100) & (df_inputs_prepr['mths_since_recent_bc'] <= 130), 1, 0)
df_inputs_prepr['mths_since_recent_bc:>130'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 130), 1, 0)

Variable: 'mths_since_recent_revol_delinq'¶

In [1442]:
df_inputs_prepr['mths_since_recent_revol_delinq'].nunique()
Out[1442]:
139
In [1443]:
df_inputs_prepr.loc[df_inputs_prepr['mths_since_recent_revol_delinq'] >= 999., : ]['mths_since_recent_revol_delinq'].count()
Out[1443]:
182457

There are 729465 missing value filled with '999'

In [1444]:
# 'mths_since_recent_revol_delinq'
# the categories of everyone with 'mths_since_recent_revol_delinq' less or equal 84.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mths_since_recent_revol_delinq'] <= 120, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['mths_since_recent_revol_delinq_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_revol_delinq'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_recent_revol_delinq_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3078955673.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['mths_since_recent_revol_delinq_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_revol_delinq'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3078955673.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['mths_since_recent_revol_delinq_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_revol_delinq'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1444]:
mths_since_recent_revol_delinq_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.12, 2.4] 1611 0.218498 0.017557 352.0 1259.0 0.017197 0.017661 0.679916 NaN NaN 0.001497
1 (2.4, 4.8] 2141 0.223260 0.023334 478.0 1663.0 0.023352 0.023328 0.693665 0.004762 0.013748 0.001497
2 (4.8, 7.2] 4412 0.241387 0.048084 1065.0 3347.0 0.052030 0.046951 0.745822 0.018127 0.052157 0.001497
3 (7.2, 9.6] 3107 0.228194 0.033862 709.0 2398.0 0.034638 0.033639 0.707888 0.013193 0.037934 0.001497
4 (9.6, 12.0] 4736 0.229941 0.051615 1089.0 3647.0 0.053202 0.051159 0.712918 0.001746 0.005029 0.001497
5 (12.0, 14.4] 3210 0.224611 0.034984 721.0 2489.0 0.035224 0.034915 0.697560 0.005330 0.015358 0.001497
6 (14.4, 16.8] 3146 0.223140 0.034287 702.0 2444.0 0.034296 0.034284 0.693319 0.001470 0.004240 0.001497
7 (16.8, 19.2] 4728 0.228003 0.051528 1078.0 3650.0 0.052665 0.051201 0.707338 0.004863 0.014018 0.001497
8 (19.2, 21.6] 2996 0.226302 0.032652 678.0 2318.0 0.033123 0.032516 0.702435 0.001702 0.004903 0.001497
9 (21.6, 24.0] 4266 0.230192 0.046493 982.0 3284.0 0.047975 0.046067 0.713641 0.003890 0.011206 0.001497
10 (24.0, 26.4] 3029 0.224166 0.033011 679.0 2350.0 0.033172 0.032965 0.696279 0.006026 0.017363 0.001497
11 (26.4, 28.8] 2947 0.227689 0.032118 671.0 2276.0 0.032781 0.031927 0.706433 0.003523 0.010154 0.001497
12 (28.8, 31.2] 4179 0.227327 0.045545 950.0 3229.0 0.046412 0.045296 0.705390 0.000362 0.001043 0.001497
13 (31.2, 33.6] 2728 0.215176 0.029731 587.0 2141.0 0.028678 0.030034 0.670313 0.012151 0.035076 0.001497
14 (33.6, 36.0] 3969 0.222978 0.043256 885.0 3084.0 0.043236 0.043262 0.692851 0.007802 0.022537 0.001497
15 (36.0, 38.4] 2587 0.222652 0.028194 576.0 2011.0 0.028140 0.028210 0.691909 0.000326 0.000942 0.001497
16 (38.4, 40.8] 2571 0.213536 0.028020 549.0 2022.0 0.026821 0.028364 0.665568 0.009116 0.026342 0.001497
17 (40.8, 43.2] 3754 0.216569 0.040913 813.0 2941.0 0.039719 0.041256 0.674342 0.003033 0.008774 0.001497
18 (43.2, 45.6] 2361 0.221093 0.025731 522.0 1839.0 0.025502 0.025797 0.687410 0.004524 0.013068 0.001497
19 (45.6, 48.0] 3641 0.213678 0.039681 778.0 2863.0 0.038009 0.040162 0.665978 0.007415 0.021432 0.001497
20 (48.0, 50.4] 1806 0.224252 0.019683 405.0 1401.0 0.019786 0.019653 0.696527 0.010575 0.030548 0.001497
21 (50.4, 52.8] 1400 0.242143 0.015258 339.0 1061.0 0.016562 0.014883 0.747991 0.017890 0.051464 0.001497
22 (52.8, 55.2] 2214 0.211834 0.024129 469.0 1745.0 0.022913 0.024479 0.660641 0.030309 0.087350 0.001497
23 (55.2, 57.6] 1582 0.201643 0.017241 319.0 1263.0 0.015585 0.017717 0.631076 0.010190 0.029565 0.001497
24 (57.6, 60.0] 2269 0.215954 0.024729 490.0 1779.0 0.023939 0.024955 0.672564 0.014311 0.041488 0.001497
25 (60.0, 62.4] 1490 0.228188 0.016239 340.0 1150.0 0.016610 0.016132 0.707869 0.012234 0.035305 0.001497
26 (62.4, 64.8] 1531 0.225996 0.016686 346.0 1185.0 0.016904 0.016623 0.701554 0.002192 0.006316 0.001497
27 (64.8, 67.2] 2426 0.220528 0.026440 535.0 1891.0 0.026137 0.026527 0.685779 0.005468 0.015775 0.001497
28 (67.2, 69.6] 1576 0.227157 0.017176 358.0 1218.0 0.017490 0.017086 0.704900 0.006630 0.019122 0.001497
29 (69.6, 72.0] 2337 0.216945 0.025470 507.0 1830.0 0.024769 0.025671 0.675428 0.010213 0.029472 0.001497
30 (72.0, 74.4] 1579 0.212160 0.017209 335.0 1244.0 0.016366 0.017451 0.661584 0.004785 0.013844 0.001497
31 (74.4, 76.8] 1577 0.224477 0.017187 354.0 1223.0 0.017294 0.017156 0.697174 0.012317 0.035589 0.001497
32 (76.8, 79.2] 1885 0.207427 0.020544 391.0 1494.0 0.019102 0.020958 0.647870 0.017050 0.049304 0.001497
33 (79.2, 81.6] 1143 0.200350 0.012457 229.0 914.0 0.011188 0.012821 0.627315 0.007077 0.020555 0.001497
34 (81.6, 84.0] 362 0.218232 0.003945 79.0 283.0 0.003859 0.003970 0.679148 0.017882 0.051834 0.001497
35 (84.0, 86.4] 85 0.223529 0.000926 19.0 66.0 0.000928 0.000926 0.694441 0.005297 0.015293 0.001497
36 (86.4, 88.8] 58 0.155172 0.000632 9.0 49.0 0.000440 0.000687 0.494499 0.068357 0.199943 0.001497
37 (88.8, 91.2] 74 0.243243 0.000806 18.0 56.0 0.000879 0.000786 0.751149 0.088071 0.256650 0.001497
38 (91.2, 93.6] 39 0.230769 0.000425 9.0 30.0 0.000440 0.000421 0.715302 0.012474 0.035847 0.001497
39 (93.6, 96.0] 46 0.195652 0.000501 9.0 37.0 0.000440 0.000519 0.613638 0.035117 0.101664 0.001497
40 (96.0, 98.4] 31 0.387097 0.000338 12.0 19.0 0.000586 0.000267 1.163022 0.191445 0.549384 0.001497
41 (98.4, 100.8] 24 0.291667 0.000262 7.0 17.0 0.000342 0.000238 0.889555 0.095430 0.273468 0.001497
42 (100.8, 103.2] 27 0.259259 0.000294 7.0 20.0 0.000342 0.000281 0.797029 0.032407 0.092526 0.001497
43 (103.2, 105.6] 13 0.153846 0.000142 2.0 11.0 0.000098 0.000154 0.490550 0.105413 0.306479 0.001497
44 (105.6, 108.0] 22 0.363636 0.000240 8.0 14.0 0.000391 0.000196 1.095308 0.209790 0.604758 0.001497
45 (108.0, 110.4] 9 0.222222 0.000098 2.0 7.0 0.000098 0.000098 0.690670 0.141414 0.404638 0.001497
46 (110.4, 112.8] 8 0.125000 0.000087 1.0 7.0 0.000049 0.000098 0.403814 0.097222 0.286856 0.001497
47 (112.8, 115.2] 11 0.272727 0.000120 3.0 8.0 0.000147 0.000112 0.835517 0.147727 0.431702 0.001497
48 (115.2, 117.6] 7 0.285714 0.000076 2.0 5.0 0.000098 0.000070 0.872578 0.012987 0.037061 0.001497
49 (117.6, 120.0] 6 0.166667 0.000065 1.0 5.0 0.000049 0.000070 0.528589 0.119048 0.343989 0.001497
In [1445]:
plot_by_woe(df_temp.iloc[ : , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1447]:
# We create the following categories: '0-20', '20-34', '34-50', '50-84', '>84', 'Missing'.
# '> 84' will be the reference category.
df_inputs_prepr['mths_since_recent_revol_delinq:0-20'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] <= 20), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:20-34'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] > 20) & (df_inputs_prepr['mths_since_recent_revol_delinq'] <= 34), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:34-50'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] > 34) & (df_inputs_prepr['mths_since_recent_revol_delinq'] <= 50), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:50-84'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] > 50) & (df_inputs_prepr['mths_since_recent_revol_delinq'] <= 84), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:>84'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] > 84) & (df_inputs_prepr['mths_since_recent_revol_delinq'] <= 800), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:Missing'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] == 999), 1, 0)

Variable: 'percent_bc_gt_75'¶

In [1448]:
df_inputs_prepr['percent_bc_gt_75'].nunique()
Out[1448]:
183
In [1449]:
# 'percent_bc_gt_75'
# the categories of everyone with 'percent_bc_gt_75' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['percent_bc_gt_75'] <= 300, : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['percent_bc_gt_75_factor'] = pd.cut(df_inputs_prepr_temp['percent_bc_gt_75'], 25)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'percent_bc_gt_75_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\4160733302.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['percent_bc_gt_75_factor'] = pd.cut(df_inputs_prepr_temp['percent_bc_gt_75'], 25)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1449]:
percent_bc_gt_75_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (-0.1, 4.0] 63770 0.179991 0.232539 11478.0 52292.0 0.195197 0.242731 0.590102 NaN NaN 0.013022
1 (4.0, 8.0] 651 0.239631 0.002374 156.0 495.0 0.002653 0.002298 0.767612 0.059641 0.177511 0.013022
2 (8.0, 12.0] 2922 0.216632 0.010655 633.0 2289.0 0.010765 0.010625 0.699703 0.022999 0.067909 0.013022
3 (12.0, 16.0] 5355 0.205602 0.019527 1101.0 4254.0 0.018724 0.019746 0.666915 0.011030 0.032788 0.013022
4 (16.0, 20.0] 13018 0.200031 0.047470 2604.0 10414.0 0.044284 0.048340 0.650290 0.005572 0.016624 0.013022
5 (20.0, 24.0] 1336 0.203593 0.004872 272.0 1064.0 0.004626 0.004939 0.660924 0.003562 0.010634 0.013022
6 (24.0, 28.0] 12199 0.201574 0.044484 2459.0 9740.0 0.041818 0.045211 0.654899 0.002019 0.006025 0.013022
7 (28.0, 32.0] 3266 0.216473 0.011910 707.0 2559.0 0.012023 0.011878 0.699230 0.014899 0.044330 0.013022
8 (32.0, 36.0] 18131 0.207600 0.066115 3764.0 14367.0 0.064011 0.066689 0.672866 0.008873 0.026364 0.013022
9 (36.0, 40.0] 21265 0.190877 0.077543 4059.0 17206.0 0.069028 0.079867 0.622878 0.016723 0.049988 0.013022
10 (40.0, 44.0] 2460 0.238211 0.008970 586.0 1874.0 0.009966 0.008699 0.763435 0.047334 0.140557 0.013022
11 (44.0, 48.0] 1076 0.266729 0.003924 287.0 789.0 0.004881 0.003662 0.847014 0.028517 0.083579 0.013022
12 (48.0, 52.0] 29228 0.217394 0.106581 6354.0 22874.0 0.108058 0.106177 0.701962 0.049334 0.145052 0.013022
13 (52.0, 56.0] 900 0.257778 0.003282 232.0 668.0 0.003945 0.003101 0.820844 0.040383 0.118882 0.013022
14 (56.0, 60.0] 8036 0.238676 0.029303 1918.0 6118.0 0.032618 0.028399 0.764802 0.019102 0.056042 0.013022
15 (60.0, 64.0] 1194 0.256281 0.004354 306.0 888.0 0.005204 0.004122 0.816464 0.017605 0.051662 0.013022
16 (64.0, 68.0] 17588 0.233057 0.064135 4099.0 13489.0 0.069709 0.062614 0.748256 0.023225 0.068209 0.013022
17 (68.0, 72.0] 1995 0.258145 0.007275 515.0 1480.0 0.008758 0.006870 0.821920 0.025089 0.073664 0.013022
18 (72.0, 76.0] 10247 0.238899 0.037366 2448.0 7799.0 0.041631 0.036202 0.765459 0.019246 0.056461 0.013022
19 (76.0, 80.0] 5941 0.267127 0.021664 1587.0 4354.0 0.026989 0.020211 0.848177 0.028228 0.082718 0.013022
20 (80.0, 84.0] 2908 0.256878 0.010604 747.0 2161.0 0.012704 0.010031 0.818209 0.010249 0.029967 0.013022
21 (84.0, 88.0] 2148 0.281192 0.007833 604.0 1544.0 0.010272 0.007167 0.889209 0.024314 0.070999 0.013022
22 (88.0, 92.0] 710 0.315493 0.002589 224.0 486.0 0.003809 0.002256 0.989025 0.034301 0.099817 0.013022
23 (92.0, 96.0] 70 0.328571 0.000255 23.0 47.0 0.000391 0.000218 1.027069 0.013078 0.038044 0.013022
24 (96.0, 100.0] 47820 0.243392 0.174377 11639.0 36181.0 0.197935 0.167946 0.778666 0.085180 0.248403 0.013022
In [1450]:
plot_by_woe(df_temp.iloc[ : , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1452]:
# We create the following categories: '0-4', '4-20', '20-40', '40-70', '70-96', '>96'.
# '> 84' will be the reference category.
df_inputs_prepr['percent_bc_gt_75:0-4'] = np.where((df_inputs_prepr['percent_bc_gt_75'] <= 4), 1, 0)
df_inputs_prepr['percent_bc_gt_75:4-20'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 4) & (df_inputs_prepr['percent_bc_gt_75'] <= 20), 1, 0)
df_inputs_prepr['percent_bc_gt_75:20-40'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 20) & (df_inputs_prepr['percent_bc_gt_75'] <= 40), 1, 0)
df_inputs_prepr['percent_bc_gt_75:40-70'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 40) & (df_inputs_prepr['percent_bc_gt_75'] <= 70), 1, 0)
df_inputs_prepr['percent_bc_gt_75:70-96'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 70) & (df_inputs_prepr['percent_bc_gt_75'] <= 96), 1, 0)
df_inputs_prepr['percent_bc_gt_75:>96'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 96), 1, 0)

Variable: 'pub_rec_bankruptcies'¶

In [1453]:
df_inputs_prepr['pub_rec_bankruptcies'].unique()
Out[1453]:
array([ 0.,  1.,  2.,  3.,  6.,  5.,  4.,  7.,  8., 12.])
In [1454]:
# 'mths_since_recent_inq'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'pub_rec_bankruptcies', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1454]:
pub_rec_bankruptcies n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 240047 0.211046 0.875336 50661.0 189386.0 0.861552 0.879099 0.683117 NaN NaN 0.001597
1 1.0 32104 0.236170 0.117068 7582.0 24522.0 0.128941 0.113827 0.757427 0.025124 0.074310 0.001597
2 2.0 1611 0.271260 0.005875 437.0 1174.0 0.007432 0.005450 0.860245 0.035090 0.102818 0.001597
3 3.0 346 0.239884 0.001262 83.0 263.0 0.001412 0.001221 0.768357 0.031376 0.091888 0.001597
4 4.0 85 0.282353 0.000310 24.0 61.0 0.000408 0.000283 0.892592 0.042469 0.124235 0.001597
5 5.0 27 0.333333 0.000098 9.0 18.0 0.000153 0.000084 1.040928 0.050980 0.148336 0.001597
6 6.0 9 0.444444 0.000033 4.0 5.0 0.000068 0.000023 1.368881 0.111111 0.327953 0.001597
7 7.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.055556 0.170925 0.001597
8 8.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.000000 0.000000 0.001597
9 12.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.500000 1.539806 0.001597
In [1455]:
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1457]:
# We create the following categories: '0', '1-3', '>=4'.
# '>=4' will be the reference category
df_inputs_prepr['pub_rec_bankruptcies:0'] = np.where(df_inputs_prepr['pub_rec_bankruptcies'].isin([0]), 1, 0)
df_inputs_prepr['pub_rec_bankruptcies:1-3'] = np.where(df_inputs_prepr['pub_rec_bankruptcies'].isin(range(1, 4)), 1, 0)
df_inputs_prepr['pub_rec_bankruptcies:>4'] = np.where(df_inputs_prepr['pub_rec_bankruptcies'].isin(range(4, 100)), 1, 0)

Variable: 'tot_coll_amt'¶

In [1458]:
df_inputs_prepr['tot_coll_amt'].nunique()
Out[1458]:
6183
In [1459]:
df_inputs_prepr.loc[df_inputs_prepr['tot_coll_amt'] == 0., : ]['tot_coll_amt'].count()
Out[1459]:
234254

There are 936642 rows with 0 value. A category will be considered for 0.

In [1460]:
# 'tot_coll_amt'
# the categories of everyone with 'tot_coll_amt' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[(df_inputs_prepr['tot_coll_amt'] != 0) & (df_inputs_prepr['tot_coll_amt'] <= 1000), : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['tot_coll_amt_factor'] = pd.cut(df_inputs_prepr_temp['tot_coll_amt'], 50)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'tot_coll_amt_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1959157564.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['tot_coll_amt_factor'] = pd.cut(df_inputs_prepr_temp['tot_coll_amt'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1959157564.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_inputs_prepr_temp['tot_coll_amt_factor'] = pd.cut(df_inputs_prepr_temp['tot_coll_amt'], 50)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1460]:
tot_coll_amt_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (3.004, 23.92] 22 0.227273 0.000757 5.0 17.0 0.000711 0.000772 0.652650 NaN NaN 0.005679
1 (23.92, 43.84] 198 0.318182 0.006817 63.0 135.0 0.008959 0.006133 0.900455 0.090909 0.247805 0.005679
2 (43.84, 63.76] 2083 0.265002 0.071721 552.0 1531.0 0.078498 0.069556 0.755446 0.053179 0.145009 0.005679
3 (63.76, 83.68] 2599 0.252790 0.089488 657.0 1942.0 0.093430 0.088229 0.722198 0.012213 0.033248 0.005679
4 (83.68, 103.6] 2298 0.247607 0.079124 569.0 1729.0 0.080916 0.078552 0.708084 0.005183 0.014114 0.005679
5 (103.6, 123.52] 1322 0.217852 0.045519 288.0 1034.0 0.040956 0.046977 0.626918 0.029755 0.081166 0.005679
6 (123.52, 143.44] 1267 0.238358 0.043625 302.0 965.0 0.042947 0.043842 0.682885 0.020507 0.055968 0.005679
7 (143.44, 163.36] 1405 0.239858 0.048377 337.0 1068.0 0.047924 0.048521 0.686972 0.001499 0.004086 0.005679
8 (163.36, 183.28] 1091 0.253896 0.037565 277.0 814.0 0.039391 0.036982 0.725209 0.014038 0.038237 0.005679
9 (183.28, 203.2] 1138 0.222320 0.039183 253.0 885.0 0.035978 0.040207 0.639127 0.031576 0.086083 0.005679
10 (203.2, 223.12] 931 0.258861 0.032056 241.0 690.0 0.034272 0.031348 0.738729 0.036542 0.099603 0.005679
11 (223.12, 243.04] 822 0.229927 0.028303 189.0 633.0 0.026877 0.028758 0.659893 0.028934 0.078836 0.005679
12 (243.04, 262.96] 818 0.226161 0.028165 185.0 633.0 0.026308 0.028758 0.649616 0.003766 0.010277 0.005679
13 (262.96, 282.88] 753 0.232404 0.025927 175.0 578.0 0.024886 0.026260 0.666649 0.006242 0.017033 0.005679
14 (282.88, 302.8] 756 0.232804 0.026030 176.0 580.0 0.025028 0.026350 0.667742 0.000401 0.001092 0.005679
15 (302.8, 322.72] 651 0.239631 0.022415 156.0 495.0 0.022184 0.022489 0.686355 0.006827 0.018613 0.005679
16 (322.72, 342.64] 589 0.275042 0.020280 162.0 427.0 0.023038 0.019399 0.782777 0.035411 0.096422 0.005679
17 (342.64, 362.56] 611 0.229133 0.021038 140.0 471.0 0.019909 0.021398 0.657725 0.045910 0.125052 0.005679
18 (362.56, 382.48] 517 0.226306 0.017801 117.0 400.0 0.016638 0.018173 0.650010 0.002827 0.007715 0.005679
19 (382.48, 402.4] 554 0.187726 0.019075 104.0 450.0 0.014790 0.020444 0.544302 0.038580 0.105708 0.005679
20 (402.4, 422.32] 468 0.217949 0.016114 102.0 366.0 0.014505 0.016628 0.627183 0.030223 0.082881 0.005679
21 (422.32, 442.24] 452 0.247788 0.015563 112.0 340.0 0.015927 0.015447 0.708577 0.029839 0.081394 0.005679
22 (442.24, 462.16] 458 0.218341 0.015770 100.0 358.0 0.014221 0.016265 0.628254 0.029447 0.080323 0.005679
23 (462.16, 482.08] 406 0.266010 0.013979 108.0 298.0 0.015358 0.013539 0.758188 0.047669 0.129934 0.005679
24 (482.08, 502.0] 426 0.237089 0.014668 101.0 325.0 0.014363 0.014765 0.679426 0.028921 0.078762 0.005679
25 (502.0, 521.92] 351 0.230769 0.012086 81.0 270.0 0.011519 0.012267 0.662191 0.006320 0.017235 0.005679
26 (521.92, 541.84] 378 0.232804 0.013015 88.0 290.0 0.012514 0.013175 0.667742 0.002035 0.005551 0.005679
27 (541.84, 561.76] 357 0.277311 0.012292 99.0 258.0 0.014078 0.011721 0.788954 0.044507 0.121212 0.005679
28 (561.76, 581.68] 324 0.225309 0.011156 73.0 251.0 0.010381 0.011403 0.647288 0.052002 0.141665 0.005679
29 (581.68, 601.6] 353 0.271955 0.012154 96.0 257.0 0.013652 0.011676 0.774371 0.046646 0.127083 0.005679
30 (601.6, 621.52] 312 0.208333 0.010743 65.0 247.0 0.009243 0.011222 0.600876 0.063621 0.173495 0.005679
31 (621.52, 641.44] 280 0.242857 0.009641 68.0 212.0 0.009670 0.009632 0.695145 0.034524 0.094269 0.005679
32 (641.44, 661.36] 302 0.231788 0.010398 70.0 232.0 0.009954 0.010540 0.664970 0.011069 0.030175 0.005679
33 (661.36, 681.28] 251 0.231076 0.008642 58.0 193.0 0.008248 0.008768 0.663027 0.000712 0.001943 0.005679
34 (681.28, 701.2] 320 0.246875 0.011018 79.0 241.0 0.011234 0.010949 0.706091 0.015799 0.043064 0.005679
35 (701.2, 721.12] 277 0.231047 0.009538 64.0 213.0 0.009101 0.009677 0.662948 0.015828 0.043142 0.005679
36 (721.12, 741.04] 272 0.253676 0.009365 69.0 203.0 0.009812 0.009223 0.724613 0.022630 0.061665 0.005679
37 (741.04, 760.96] 239 0.271967 0.008229 65.0 174.0 0.009243 0.007905 0.774403 0.018290 0.049790 0.005679
38 (760.96, 780.88] 210 0.233333 0.007231 49.0 161.0 0.006968 0.007315 0.669185 0.038633 0.105218 0.005679
39 (780.88, 800.8] 241 0.248963 0.008298 60.0 181.0 0.008532 0.008223 0.711777 0.015629 0.042592 0.005679
40 (800.8, 820.72] 239 0.259414 0.008229 62.0 177.0 0.008817 0.008041 0.740234 0.010452 0.028457 0.005679
41 (820.72, 840.64] 215 0.283721 0.007403 61.0 154.0 0.008675 0.006997 0.806410 0.024307 0.066176 0.005679
42 (840.64, 860.56] 200 0.250000 0.006886 50.0 150.0 0.007110 0.006815 0.714602 0.033721 0.091808 0.005679
43 (860.56, 880.48] 219 0.251142 0.007541 55.0 164.0 0.007821 0.007451 0.717711 0.001142 0.003109 0.005679
44 (880.48, 900.4] 212 0.231132 0.007300 49.0 163.0 0.006968 0.007405 0.663181 0.020009 0.054530 0.005679
45 (900.4, 920.32] 191 0.235602 0.006576 45.0 146.0 0.006399 0.006633 0.675372 0.004470 0.012191 0.005679
46 (920.32, 940.24] 173 0.179191 0.005957 31.0 142.0 0.004408 0.006451 0.520778 0.056411 0.154594 0.005679
47 (940.24, 960.16] 166 0.234940 0.005716 39.0 127.0 0.005546 0.005770 0.673566 0.055749 0.152788 0.005679
48 (960.16, 980.08] 159 0.276730 0.005475 44.0 115.0 0.006257 0.005225 0.787371 0.041790 0.113805 0.005679
49 (980.08, 1000.0] 167 0.245509 0.005750 41.0 126.0 0.005830 0.005724 0.702370 0.031221 0.085001 0.005679
In [1461]:
plot_by_woe(df_temp.iloc[0 : 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1463]:
# We create the following categories: '0', '0-110', '110-300', '300-580', '580-1000', '>1000'.
# '> 84' will be the reference category.
df_inputs_prepr['tot_coll_amt:0'] = np.where((df_inputs_prepr['tot_coll_amt'] == 0), 1, 0)
df_inputs_prepr['tot_coll_amt:0-110'] = np.where((df_inputs_prepr['tot_coll_amt'] > 0) & (df_inputs_prepr['tot_coll_amt'] <= 110), 1, 0)
df_inputs_prepr['tot_coll_amt:110-300'] = np.where((df_inputs_prepr['tot_coll_amt'] > 110) & (df_inputs_prepr['tot_coll_amt'] <= 300), 1, 0)
df_inputs_prepr['tot_coll_amt:300-580'] = np.where((df_inputs_prepr['tot_coll_amt'] > 300) & (df_inputs_prepr['tot_coll_amt'] <= 580), 1, 0)
df_inputs_prepr['tot_coll_amt:580-1000'] = np.where((df_inputs_prepr['tot_coll_amt'] > 580) & (df_inputs_prepr['tot_coll_amt'] <= 1000), 1, 0)
df_inputs_prepr['tot_coll_amt:>1000'] = np.where((df_inputs_prepr['tot_coll_amt'] > 1000), 1, 0)

Variable: 'mort_acc'¶

In [1464]:
df_inputs_prepr['mort_acc'].unique()
Out[1464]:
array([ 0.,  2.,  3.,  6.,  5.,  4.,  1.,  7., 10.,  9.,  8., 14., 11.,
       17., 16., 12., 13., 15., 21., 19., 24., 22., 20., 18., 29., 27.,
       34., 28., 31., 23., 35., 25.])
In [1465]:
# 'fico_range_high'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'mort_acc', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1465]:
mort_acc n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 116758 0.246330 0.425760 28761.0 87997.0 0.489116 0.408468 0.787294 NaN NaN inf
1 1.0 46100 0.219176 0.168105 10104.0 35996.0 0.171831 0.167088 0.707242 0.027154 0.080052 inf
2 2.0 38601 0.197016 0.140759 7605.0 30996.0 0.129332 0.143878 0.641275 0.022160 0.065967 inf
3 3.0 28153 0.182005 0.102661 5124.0 23029.0 0.087140 0.106897 0.596183 0.015010 0.045092 inf
4 4.0 19247 0.167662 0.070185 3227.0 16020.0 0.054879 0.074362 0.552733 0.014343 0.043450 inf
5 5.0 11714 0.161687 0.042715 1894.0 9820.0 0.032210 0.045583 0.534515 0.005976 0.018218 inf
6 6.0 6571 0.160554 0.023961 1055.0 5516.0 0.017942 0.025604 0.531053 0.001133 0.003462 inf
7 7.0 3503 0.151870 0.012774 532.0 2971.0 0.009047 0.013791 0.504426 0.008684 0.026627 inf
8 8.0 1656 0.141304 0.006039 234.0 1422.0 0.003979 0.006601 0.471805 0.010565 0.032621 inf
9 9.0 879 0.142207 0.003205 125.0 754.0 0.002126 0.003500 0.474602 0.000903 0.002797 inf
10 10.0 443 0.115124 0.001615 51.0 392.0 0.000867 0.001820 0.389778 0.027083 0.084824 inf
11 11.0 251 0.163347 0.000915 41.0 210.0 0.000697 0.000975 0.539583 0.048222 0.149805 inf
12 12.0 127 0.125984 0.000463 16.0 111.0 0.000272 0.000515 0.424024 0.037362 0.115558 inf
13 13.0 67 0.134328 0.000244 9.0 58.0 0.000153 0.000269 0.450122 0.008344 0.026097 inf
14 14.0 60 0.116667 0.000219 7.0 53.0 0.000119 0.000246 0.394662 0.017662 0.055459 inf
15 15.0 34 0.117647 0.000124 4.0 30.0 0.000068 0.000139 0.397763 0.000980 0.003101 inf
16 16.0 20 0.250000 0.000073 5.0 15.0 0.000085 0.000070 0.798060 0.132353 0.400297 inf
17 17.0 9 0.111111 0.000033 1.0 8.0 0.000017 0.000037 0.377039 0.138889 0.421022 inf
18 18.0 6 0.000000 0.000022 0.0 6.0 0.000000 0.000028 0.000000 0.111111 0.377039 inf
19 19.0 6 0.333333 0.000022 2.0 4.0 0.000034 0.000019 1.040928 0.333333 1.040928 inf
20 20.0 9 0.222222 0.000033 2.0 7.0 0.000034 0.000032 0.716262 0.111111 0.324666 inf
21 21.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.277778 0.823544 inf
22 22.0 4 0.250000 0.000015 1.0 3.0 0.000017 0.000014 0.798060 0.250000 0.741746 inf
23 23.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.250000 0.798060 inf
24 24.0 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.000000 0.000000 inf
25 25.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
26 27.0 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.000000 0.000000 inf
27 28.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
28 29.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
29 31.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
30 34.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 inf
31 35.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
In [1466]:
plot_by_woe(df_temp.iloc[6 : 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1468]:
# We create the following categories: '0', '1', '2', '3-5', '6-12', '13-18', '>=19'.
# '>=19' will be the reference category
df_inputs_prepr['mort_acc:0'] = np.where(df_inputs_prepr['mort_acc'].isin([0]), 1, 0)
df_inputs_prepr['mort_acc:1'] = np.where(df_inputs_prepr['mort_acc'].isin([1]), 1, 0)
df_inputs_prepr['mort_acc:2'] = np.where(df_inputs_prepr['mort_acc'].isin([2]), 1, 0)
df_inputs_prepr['mort_acc:3-5'] = np.where(df_inputs_prepr['mort_acc'].isin(range(3, 6)), 1, 0)
df_inputs_prepr['mort_acc:6-12'] = np.where(df_inputs_prepr['mort_acc'].isin(range(6, 13)), 1, 0)
df_inputs_prepr['mort_acc:13-18'] = np.where(df_inputs_prepr['mort_acc'].isin(range(13, 19)), 1, 0)
df_inputs_prepr['mort_acc:>=19'] = np.where(df_inputs_prepr['mort_acc'].isin(range(19, 200)), 1, 0)

Variable: 'months_since_last_credit_pull'¶

In [1469]:
df_inputs_prepr['months_since_last_credit_pull'].nunique()
Out[1469]:
128
In [1470]:
# 'months_since_last_credit_pull'
# the categories of everyone with 'months_since_last_credit_pull' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[(df_inputs_prepr['months_since_last_credit_pull'] <= 1000), : ]

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['months_since_last_credit_pull_factor'] = pd.cut(df_inputs_prepr_temp['months_since_last_credit_pull'], 40)

# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'months_since_last_credit_pull_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\790836033.py:7: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_inputs_prepr_temp['months_since_last_credit_pull_factor'] = pd.cut(df_inputs_prepr_temp['months_since_last_credit_pull'], 40)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
Out[1470]:
months_since_last_credit_pull_factor n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 (21.857, 25.575] 134899 0.166554 0.491912 22468.0 112431.0 0.382096 0.521886 0.549360 NaN NaN 0.264175
1 (25.575, 29.15] 33208 0.188539 0.121094 6261.0 26947.0 0.106476 0.125084 0.615855 0.021985 0.066495 0.264175
2 (29.15, 32.725] 17636 0.229474 0.064310 4047.0 13589.0 0.068824 0.063078 0.737689 0.040935 0.121834 0.264175
3 (32.725, 36.3] 18135 0.207996 0.066130 3772.0 14363.0 0.064147 0.066671 0.674043 0.021478 0.063646 0.264175
4 (36.3, 39.875] 7034 0.263861 0.025650 1856.0 5178.0 0.031564 0.024035 0.838636 0.055866 0.164593 0.264175
5 (39.875, 43.45] 13869 0.299156 0.050574 4149.0 9720.0 0.070559 0.045119 0.941510 0.035295 0.102874 0.264175
6 (43.45, 47.025] 10270 0.289289 0.037450 2971.0 7299.0 0.050525 0.033881 0.912794 0.009867 0.028716 0.264175
7 (47.025, 50.6] 9154 0.495193 0.033380 4533.0 4621.0 0.077089 0.021450 1.524733 0.205904 0.611939 0.264175
8 (50.6, 54.175] 12921 0.627970 0.047117 8114.0 4807.0 0.137989 0.022313 1.971875 0.132777 0.447142 0.264175
9 (54.175, 57.75] 2569 0.027637 0.009368 71.0 2498.0 0.001207 0.011595 0.099059 0.600333 1.872816 0.264175
10 (57.75, 61.325] 3677 0.034267 0.013408 126.0 3551.0 0.002143 0.016483 0.122216 0.006630 0.023157 0.264175
11 (61.325, 64.9] 1809 0.030404 0.006597 55.0 1754.0 0.000935 0.008142 0.108748 0.003864 0.013468 0.264175
12 (64.9, 68.475] 2061 0.040272 0.007515 83.0 1978.0 0.001412 0.009182 0.143004 0.009868 0.034255 0.264175
13 (68.475, 72.05] 1657 0.036210 0.006042 60.0 1597.0 0.001020 0.007413 0.128961 0.004062 0.014042 0.264175
14 (72.05, 75.625] 890 0.052809 0.003245 47.0 843.0 0.000799 0.003913 0.185867 0.016599 0.056906 0.264175
15 (75.625, 79.2] 1020 0.048039 0.003719 49.0 971.0 0.000833 0.004507 0.169643 0.004770 0.016224 0.264175
16 (79.2, 82.775] 590 0.047458 0.002151 28.0 562.0 0.000476 0.002609 0.167658 0.000582 0.001985 0.264175
17 (82.775, 86.35] 626 0.046326 0.002283 29.0 597.0 0.000493 0.002771 0.163791 0.001132 0.003867 0.264175
18 (86.35, 89.925] 354 0.033898 0.001291 12.0 342.0 0.000204 0.001588 0.120934 0.012428 0.042857 0.264175
19 (89.925, 93.5] 404 0.024752 0.001473 10.0 394.0 0.000170 0.001829 0.088914 0.009146 0.032020 0.264175
20 (93.5, 97.075] 354 0.039548 0.001291 14.0 340.0 0.000238 0.001578 0.140507 0.014796 0.051593 0.264175
21 (97.075, 100.65] 181 0.038674 0.000660 7.0 174.0 0.000119 0.000808 0.137489 0.000874 0.003018 0.264175
22 (100.65, 104.225] 219 0.045662 0.000799 10.0 209.0 0.000170 0.000970 0.161520 0.006988 0.024031 0.264175
23 (104.225, 107.8] 106 0.018868 0.000387 2.0 104.0 0.000034 0.000483 0.068084 0.026794 0.093436 0.264175
24 (107.8, 111.375] 156 0.076923 0.000569 12.0 144.0 0.000204 0.000668 0.266438 0.058055 0.198354 0.264175
25 (111.375, 114.95] 60 0.066667 0.000219 4.0 56.0 0.000068 0.000260 0.232454 0.010256 0.033985 0.264175
26 (114.95, 118.525] 104 0.038462 0.000379 4.0 100.0 0.000068 0.000464 0.136755 0.028205 0.095698 0.264175
27 (118.525, 122.1] 96 0.010417 0.000350 1.0 95.0 0.000017 0.000441 0.037840 0.028045 0.098915 0.264175
28 (122.1, 125.675] 50 0.080000 0.000182 4.0 46.0 0.000068 0.000214 0.276556 0.069583 0.238716 0.264175
29 (125.675, 129.25] 38 0.000000 0.000139 0.0 38.0 0.000000 0.000176 0.000000 0.080000 0.276556 0.264175
30 (129.25, 132.825] 19 0.000000 0.000069 0.0 19.0 0.000000 0.000088 0.000000 0.000000 0.000000 0.264175
31 (132.825, 136.4] 25 0.040000 0.000091 1.0 24.0 0.000017 0.000111 0.142067 0.040000 0.142067 0.264175
32 (136.4, 139.975] 12 0.083333 0.000044 1.0 11.0 0.000017 0.000051 0.287479 0.043333 0.145412 0.264175
33 (139.975, 143.55] 12 0.000000 0.000044 0.0 12.0 0.000000 0.000056 0.000000 0.083333 0.287479 0.264175
34 (143.55, 147.125] 8 0.125000 0.000029 1.0 7.0 0.000017 0.000032 0.420934 0.125000 0.420934 0.264175
35 (147.125, 150.7] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.125000 0.420934 0.264175
36 (150.7, 154.275] 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.000000 0.000000 0.264175
37 (154.275, 157.85] 0 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.264175
38 (157.85, 161.425] 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 NaN NaN 0.264175
39 (161.425, 165.0] 6 0.000000 0.000022 0.0 6.0 0.000000 0.000028 0.000000 0.000000 0.000000 0.264175
In [1471]:
plot_by_woe(df_temp.iloc[ 10: 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1473]:
# We create the following categories: '<=30', '30-48', '48-55', '55-110', '>110'.
# '> 110' will be the reference category.
df_inputs_prepr['months_since_last_credit_pull:<=30'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] <= 30), 1, 0)
df_inputs_prepr['months_since_last_credit_pull:30-48'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] > 30) & (df_inputs_prepr['months_since_last_credit_pull'] <= 48), 1, 0)
df_inputs_prepr['months_since_last_credit_pull:48-55'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] > 48) & (df_inputs_prepr['months_since_last_credit_pull'] <= 55), 1, 0)
df_inputs_prepr['months_since_last_credit_pull:55-110'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] > 55) & (df_inputs_prepr['months_since_last_credit_pull'] <= 110), 1, 0)
df_inputs_prepr['months_since_last_credit_pull:>110'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] > 110), 1, 0)

Variable: 'total_public_records'¶

In [1474]:
df_inputs_prepr['total_public_records'].unique()
Out[1474]:
array([ 0.,  8.,  1.,  2., 14.,  4.,  3.,  9., 10.,  6., 93.,  5., 11.,
        7., 12., 16., 18., 13., 19., 21., 49., 20., 25., 37., 17., 15.,
       22., 79., 88., 24., 41., 31., 29., 23., 91.])
In [1475]:
# 'fico_range_high'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_public_records', df_targets_prepr)

# We calculate weight of evidence.
df_temp
Out[1475]:
total_public_records n_obs prop_good prop_n_obs n_good n_bad prop_n_good prop_n_bad WoE diff_prop_good diff_WoE IV
0 0.0 227853 0.209284 0.830871 47686.0 180167.0 0.810959 0.836306 0.677877 NaN NaN inf
1 1.0 34418 0.237347 0.125506 8169.0 26249.0 0.138924 0.121844 0.760891 0.028063 0.083014 inf
2 2.0 7028 0.250427 0.025628 1760.0 5268.0 0.029931 0.024453 0.799312 0.013080 0.038421 inf
3 3.0 1674 0.232378 0.006104 389.0 1285.0 0.006615 0.005965 0.746254 0.018049 0.053058 inf
4 4.0 1471 0.235894 0.005364 347.0 1124.0 0.005901 0.005217 0.756614 0.003516 0.010360 inf
5 5.0 472 0.233051 0.001721 110.0 362.0 0.001871 0.001680 0.748239 0.002843 0.008376 inf
6 6.0 546 0.260073 0.001991 142.0 404.0 0.002415 0.001875 0.827560 0.027022 0.079322 inf
7 7.0 157 0.299363 0.000573 47.0 110.0 0.000799 0.000511 0.942112 0.039290 0.114551 inf
8 8.0 215 0.218605 0.000784 47.0 168.0 0.000799 0.000780 0.705550 0.080758 0.236562 inf
9 9.0 81 0.197531 0.000295 16.0 65.0 0.000272 0.000302 0.642817 0.021074 0.062733 inf
10 10.0 112 0.232143 0.000408 26.0 86.0 0.000442 0.000399 0.745562 0.034612 0.102745 inf
11 11.0 27 0.259259 0.000098 7.0 20.0 0.000119 0.000093 0.825179 0.027116 0.079617 inf
12 12.0 69 0.289855 0.000252 20.0 49.0 0.000340 0.000227 0.914442 0.030596 0.089262 inf
13 13.0 16 0.125000 0.000058 2.0 14.0 0.000034 0.000065 0.420934 0.164855 0.493508 inf
14 14.0 20 0.500000 0.000073 10.0 10.0 0.000170 0.000046 1.539806 0.375000 1.118872 inf
15 15.0 10 0.100000 0.000036 1.0 9.0 0.000017 0.000042 0.341514 0.400000 1.198292 inf
16 16.0 13 0.307692 0.000047 4.0 9.0 0.000068 0.000042 0.966339 0.207692 0.624825 inf
17 17.0 2 0.500000 0.000007 1.0 1.0 0.000017 0.000005 1.539806 0.192308 0.573467 inf
18 18.0 8 0.250000 0.000029 2.0 6.0 0.000034 0.000028 0.798060 0.250000 0.741746 inf
19 19.0 6 0.166667 0.000022 1.0 5.0 0.000017 0.000023 0.549702 0.083333 0.248358 inf
20 20.0 12 0.416667 0.000044 5.0 7.0 0.000085 0.000032 1.285622 0.250000 0.735920 inf
21 21.0 3 0.000000 0.000011 0.0 3.0 0.000000 0.000014 0.000000 0.416667 1.285622 inf
22 22.0 5 0.200000 0.000018 1.0 4.0 0.000017 0.000019 0.650199 0.200000 0.650199 inf
23 23.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.800000 inf inf
24 24.0 4 0.500000 0.000015 2.0 2.0 0.000034 0.000009 1.539806 0.500000 inf inf
25 25.0 2 0.000000 0.000007 0.0 2.0 0.000000 0.000009 0.000000 0.500000 1.539806 inf
26 29.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
27 31.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.000000 NaN inf
28 37.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.000000 NaN inf
29 41.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
30 49.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 1.000000 inf inf
31 79.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.000000 NaN inf
32 88.0 1 1.000000 0.000004 1.0 0.0 0.000017 0.000000 inf 0.000000 NaN inf
33 91.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 1.000000 inf inf
34 93.0 1 0.000000 0.000004 0.0 1.0 0.000000 0.000005 0.000000 0.000000 0.000000 inf
In [1476]:
plot_by_woe(df_temp.iloc[: 30, : ], 90)
# We plot the weight of evidence values.
No description has been provided for this image
In [1478]:
# Categories
# '0', '1-3', '4-12', '>=13'
df_inputs_prepr['total_public_records:0'] = np.where((df_inputs_prepr['total_public_records'] == 0), 1, 0)
df_inputs_prepr['total_public_records:1-3'] = np.where((df_inputs_prepr['total_rev_hi_lim'] >= 1) & (df_inputs_prepr['total_rev_hi_lim'] <= 3), 1, 0)
df_inputs_prepr['total_public_records:4-12'] = np.where((df_inputs_prepr['total_rev_hi_lim'] >= 4) & (df_inputs_prepr['total_rev_hi_lim'] <= 12), 1, 0)
df_inputs_prepr['total_public_records:>=13'] = np.where((df_inputs_prepr['total_public_records'] >= 13), 1, 0)

D. Final list of features to consider in the credit risk model.¶

In [1479]:
Final_list_features = ['grade:A', 'grade:B', 'grade:C', 'grade:D', 'grade:E', 'grade:F',
       'grade:G', 'home_ownership:MORTGAGE', 'home_ownership:OWN',
       'home_ownership:RENT_OTHER_NONE_ANY',
       'addr_state:ND_NE_IA_NV_FL_HI_AL', 'addr_state:NM_VA',
       'addr_state:OK_TN_MO_LA_MD_NC', 'addr_state:UT_KY_AZ_NJ',
       'addr_state:AR_MI_PA_OH_MN', 'addr_state:RI_MA_DE_SD_IN',
       'addr_state:GA_WA_OR', 'addr_state:WI_MT', 'addr_state:IL_CT',
       'addr_state:KS_SC_CO_VT_AK_MS', 'addr_state:WV_NH_WY_DC_ME_ID',
       'verification_status:Not Verified',
       'verification_status:Source Verified',
       'verification_status:Verified', 'purpose:debt_consolidation',
       'purpose:credit_card', 'purpose:sm_b__mov__ren_en__house__medic',
       'purpose:other__vacat__maj_purch', 'home_impr__educ__car__wed',
       'initial_list_status:f', 'initial_list_status:w',
       'application_type:Individual', 'application_type:Joint App',
       'hardship_flag:N', 'hardship_flag:Y', 'disbursement_method:Cash',
       'disbursement_method:DirectPay', 'debt_settlement_flag:N',
       'debt_settlement_flag:Y', 'term:36', 'term:60', 'num_tl_120dpd_2m:0', 'num_tl_120dpd_2m:1',
       'num_tl_120dpd_2m:2-6', 'num_tl_30dpd:0', 'num_tl_30dpd:1',
       'num_tl_30dpd:2-4', 'delinq_record_risk_score:0',
       'delinq_record_risk_score:1-2', 'delinq_record_risk_score:3-4',
       'delinq_record_risk_score:5-7', 'log_annual_inc:<20K',
       'annual_inc:20K-30K', 'annual_inc:30K-40K', 'annual_inc:40K-50K',
       'annual_inc:50K-60K', 'annual_inc:60K-70K', 'annual_inc:70K-80K',
       'annual_inc:80K-90K', 'annual_inc:90K-100K',
       'annual_inc:100K-120K', 'annual_inc:120K-140K', 'annual_inc:>140K',
       'loan_amnt:<2500', 'loan_amnt:2500-6500', 'loan_amnt:6500-9500',
       'loan_amnt:9500-11000', 'loan_amnt:11000-17500',
       'loan_amnt:17500-28500', 'loan_amnt:>=28500', 'int_rate:<=8',
       'int_rate:8-12.5', 'int_rate:12.5-16.5', 'int_rate:16.5-20',
       'int_rate:20-23.5', 'int_rate:>23.5', 'emp_length_int:0',
       'emp_length_int:1', 'emp_length_int:2-4', 'emp_length_int:5-7',
       'emp_length_int:8-9', 'emp_length_int:10', 'dti:<=10', 'dti:10-20',
       'dti:20-30', 'dti:30-40', 'dti:>40', 'min_mths_since_delinquency',
       'min_mths_since_delinquency:Missing',
       'min_mths_since_delinquency:<=20',
       'min_mths_since_delinquency:20-40',
       'min_mths_since_delinquency:40-80',
       'min_mths_since_delinquency:>80',
       'mths_since_earliest_cr_line:<=120',
       'mths_since_earliest_cr_line:121-200',
       'mths_since_earliest_cr_line:201-260',
       'mths_since_earliest_cr_line:261-320',
       'mths_since_earliest_cr_line:321-400',
       'mths_since_earliest_cr_line:401-600',
       'mths_since_earliest_cr_line:>=601', 'delinq_2yrs:0',
       'delinq_2yrs:1', 'delinq_2yrs:2-9', 'delinq_2yrs:>=10',
       'inq_last_6mths:0', 'inq_last_6mths:1-2', 'inq_last_6mths:3-5',
       'inq_last_6mths:>=6', 'collections_12_mths_ex_med:0',
       'collections_12_mths_ex_med:1', 'collections_12_mths_ex_med:>=2',
       'chargeoff_within_12_mths:0', 'chargeoff_within_12_mths:1',
       'chargeoff_within_12_mths:>=2', 'total_acc:<=20',
       'total_acc:21-56', 'total_acc:>=57', 'delinq_amnt:0',
       'delinq_amnt:>=1', 'num_accts_ever_120_pd:0',
       'num_accts_ever_120_pd:1-11', 'num_accts_ever_120_pd:>=12',
       'num_tl_90g_dpd_24m:0', 'num_tl_90g_dpd_24m:1-4',
       'num_tl_90g_dpd_24m:>=5', 'revol_bal:<=8k', 'revol_bal:8-22k',
       'revol_bal:22-35k', 'revol_bal:35-60k', 'revol_bal:60-100k',
       'revol_bal:>100k', 'total_bal_il:=0', 'total_bal_il:0-18k',
       'total_bal_il:18-30k', 'total_bal_il:30-70k',
       'total_bal_il:70-200k', 'total_bal_il:>200k', 'max_bal_bc:=0',
       'max_bal_bc:0-8k', 'max_bal_bc:8-16k', 'max_bal_bc:16-26k',
       'max_bal_bc:26-50k', 'max_bal_bc:>50k', 'avg_cur_bal:0-7k',
       'avg_cur_bal:7-15k', 'avg_cur_bal:15-30k', 'avg_cur_bal:30-50k',
       'avg_cur_bal:50-100k', 'avg_cur_bal:>100k', 'bc_open_to_buy:0-5k',
       'bc_open_to_buy:5-15k', 'bc_open_to_buy:15-30k',
       'bc_open_to_buy:30-50k', 'bc_open_to_buy:50-100k',
       'bc_open_to_buy:>100k', 'revol_bal_to_bc_limit:0-0.6',
       'revol_bal_to_bc_limit:0.6-1.2', 'revol_bal_to_bc_limit:1.2-3.6',
       'revol_bal_to_bc_limit:3.6-5.5', 'revol_bal_to_bc_limit:5.5-10.',
       'revol_bal_to_bc_limit:>10.', 'revol_bal_to_open_to_buy:0-2',
       'revol_bal_to_open_to_buy:2-4', 'revol_bal_to_open_to_buy:4-20',
       'revol_bal_to_open_to_buy:20-100', 'revol_bal_to_open_to_buy:>100',
       'total_bal_ex_mort_to_inc:0-0.4', 'total_bal_ex_mort_to_inc:0.4-1',
       'total_bal_ex_mort_to_inc:1-2.6', 'total_bal_ex_mort_to_inc:2.6-4.4',
       'total_bal_ex_mort_to_inc:>4.4', 'total_balance_to_credit_ratio:0-0.05',
       'total_balance_to_credit_ratio:0.05-0.2', 'total_balance_to_credit_ratio:0.2-0.4',
       'total_balance_to_credit_ratio:0.4-0.7', 'total_balance_to_credit_ratio:0.7-1',
       'total_balance_to_credit_ratio:1-1.4', 'total_balance_to_credit_ratio:>1.4',
       'rev_to_il_limit_ratio:0-0.6', 'rev_to_il_limit_ratio:0.6-0.8',
       'rev_to_il_limit_ratio:0.8-1.8', 'rev_to_il_limit_ratio:1.8-4.5',
       'rev_to_il_limit_ratio:4.5-10', 'rev_to_il_limit_ratio:>10.',
       'total_il_high_credit_limit:0-5k', 'total_il_high_credit_limit:5-10k',
       'total_il_high_credit_limit:10-30k', 'total_il_high_credit_limit:30-35k',
       'total_il_high_credit_limit:35-100k', 'total_il_high_credit_limit:>100k', 
       'tot_cur_bal:0-20k', 'tot_cur_bal:20-70k', 'tot_cur_bal:70-80k', 
       'tot_cur_bal:80-130k', 'tot_cur_bal:130-200k', 'tot_cur_bal:200-250k',
       'tot_cur_bal:250-500k', 'tot_cur_bal:>500k', 'open_act_il:0',
       'open_act_il:1-5', 'open_act_il:6-15', 'open_act_il:>=16',
       'open_il_12m:0', 'open_il_12m:1-5', 'open_il_12m:>=6',
       'num_actv_rev_tl:0', 'num_actv_rev_tl:1-5', 'num_actv_rev_tl:6-9',
       'num_actv_rev_tl:10-13', 'num_actv_rev_tl:14-17',
       'num_actv_rev_tl:18-26', 'num_actv_rev_tl:>=27', 'open_rv_12m:0',
       'open_rv_12m:1-2', 'open_rv_12m:3-5', 'open_rv_12m:6-8',
       'open_rv_12m:9-13', 'open_rv_12m:>=14', 'num_bc_tl:0',
       'num_bc_tl:1-5', 'num_bc_tl:6-10', 'num_bc_tl:11-20',
       'num_bc_tl:21-32', 'num_bc_tl:>=33', 'open_acc_6m:0',
       'open_acc_6m:1-3', 'open_acc_6m:4-7', 'open_acc_6m:>=8',
       'acc_open_past_24mths:0-3', 'acc_open_past_24mths:4-7',
       'acc_open_past_24mths:8-13', 'acc_open_past_24mths:14-21',
       'acc_open_past_24mths:>=22', 'total_cu_tl:0', 'total_cu_tl:1-7',
       'total_cu_tl:8-17', 'total_cu_tl:>=18', 'inq_last_12m:0',
       'inq_last_12m:1-4', 'inq_last_12m:5-9', 'inq_last_12m:10-16',
       'inq_last_12m:>=17', 'mths_since_recent_inq:Missing',
       'mths_since_recent_inq:0-1', 'mths_since_recent_inq:2-3',
       'mths_since_recent_inq:4-6', 'mths_since_recent_inq:7-10',
       'mths_since_recent_inq:11-15', 'mths_since_recent_inq:>=16',
       'out_prncp:=0', 'out_prncp:>0', 'last_pymnt_amnt:<=200',
       'last_pymnt_amnt:200-700', 'last_pymnt_amnt:700-1000',
       'last_pymnt_amnt:1000-1500', 'last_pymnt_amnt:1500-2600',
       'last_pymnt_amnt:2600-10000', 'last_pymnt_amnt:>10000',
       'principal_paid_ratio:<=0.3', 'principal_paid_ratio:0.3-0.45',
       'principal_paid_ratio:0.45-0.6', 'principal_paid_ratio:0.6-1',
       'principal_paid_ratio:=1', 'fico_range_high:<=680',
       'fico_range_high:680-700', 'fico_range_high:700-720',
       'fico_range_high:720-750', 'fico_range_high:750-795',
       'fico_range_high:>795', 'last_fico_range_high:<=520',
       'last_fico_range_high:520-550', 'last_fico_range_high:550-580',
       'last_fico_range_high:580-610', 'last_fico_range_high:610-640',
       'last_fico_range_high:640-670', 'last_fico_range_high:>670',
       'mo_sin_rcnt_rev_tl_op:0-3', 'mo_sin_rcnt_rev_tl_op:3-6',
       'mo_sin_rcnt_rev_tl_op:6-9', 'mo_sin_rcnt_rev_tl_op:9-20',
       'mo_sin_rcnt_rev_tl_op:20-37', 'mo_sin_rcnt_rev_tl_op:37-63',
       'mo_sin_rcnt_rev_tl_op:63-80', 'mo_sin_rcnt_rev_tl_op:80-140',
       'mo_sin_rcnt_rev_tl_op:>140', 'mo_sin_rcnt_tl:0-2',
       'mo_sin_rcnt_tl:2-5', 'mo_sin_rcnt_tl:5-6', 'mo_sin_rcnt_tl:6-10',
       'mo_sin_rcnt_tl:10-15', 'mo_sin_rcnt_tl:15-20',
       'mo_sin_rcnt_tl:20-50', 'mo_sin_rcnt_tl:>50',
       'mths_since_rcnt_il:0-4', 'mths_since_rcnt_il:4-10',
       'mths_since_rcnt_il:10-20', 'mths_since_rcnt_il:20-40',
       'mths_since_rcnt_il:40-100', 'mths_since_rcnt_il:>100',
       'mths_since_recent_bc:0-12', 'mths_since_recent_bc:12-32',
       'mths_since_recent_bc:32-52', 'mths_since_recent_bc:52-68',
       'mths_since_recent_bc:68-100', 'mths_since_recent_bc:100-130',
       'mths_since_recent_bc:>130', 'mths_since_rcnt_il:Missing',
       'mths_since_recent_revol_delinq:0-20', 'mths_since_recent_revol_delinq:20-34',
       'mths_since_recent_revol_delinq:34-50', 'mths_since_recent_revol_delinq:50-84',
       'mths_since_recent_revol_delinq:>84', 'mths_since_recent_revol_delinq:Missing', 
       'percent_bc_gt_75:0-4', 'percent_bc_gt_75:4-20', 'percent_bc_gt_75:20-40',
       'percent_bc_gt_75:40-70', 'percent_bc_gt_75:70-96',
       'percent_bc_gt_75:>96', 'pub_rec_bankruptcies:0',
       'pub_rec_bankruptcies:1-3', 'pub_rec_bankruptcies:>4',
       'tot_coll_amt:0', 'tot_coll_amt:0-110', 'tot_coll_amt:110-300',
       'tot_coll_amt:300-580', 'tot_coll_amt:580-1000',
       'tot_coll_amt:>1000', 'mort_acc:0', 'mort_acc:1', 'mort_acc:2',
       'mort_acc:3-5', 'mort_acc:6-12', 'mort_acc:13-18', 'mort_acc:>=19',
       'months_since_last_credit_pull:<=30', 'months_since_last_credit_pull:30-48',
       'months_since_last_credit_pull:48-55', 'months_since_last_credit_pull:55-110',
       'months_since_last_credit_pull:>110', 'total_public_records:0',
       'total_public_records:1-3', 'total_public_records:4-12',
       'total_public_records:>=13']

len(Final_list_features)
Out[1479]:
344

Feature Selection Process:¶

The final set of features selected for training the credit risk model consists of 344 variables. This refined list was obtained after a systematic feature selection process that involved removing variables with excessive missing values, eliminating features exhibiting high multicollinearity based on VIF analysis, and transforming relevant continuous variables into categorical formats using Weight of Evidence (WoE) encoding guided by both predictive power and variable distribution. This approach ensures that the retained features contribute meaningful, non-redundant information to the model while supporting better interpretability and generalization.

The following features are categorized and will not be used in the train and test of the model. They will be dropped from the dataframe file.¶

In [1480]:
features_to_drop = ['loan_amnt', 'int_rate', 'grade', 'home_ownership', 'annual_inc',
       'verification_status', 'purpose', 'addr_state', 'dti',
       'delinq_2yrs', 'fico_range_high', 'inq_last_6mths', 'revol_bal',
       'total_acc', 'initial_list_status', 'out_prncp', 'last_pymnt_amnt',
       'last_fico_range_high', 'collections_12_mths_ex_med',
       'application_type', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',
       'open_act_il', 'open_il_12m', 'mths_since_rcnt_il', 'total_bal_il',
       'open_rv_12m', 'max_bal_bc', 'total_rev_hi_lim', 'inq_fi',
       'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths',
       'avg_cur_bal', 'bc_open_to_buy', 'chargeoff_within_12_mths',
       'delinq_amnt', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
       'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_inq',
       'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd',
       'num_actv_rev_tl', 'num_bc_tl', 'num_tl_120dpd_2m', 'num_tl_30dpd',
       'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'percent_bc_gt_75',
       'pub_rec_bankruptcies', 'hardship_flag', 'disbursement_method',
       'debt_settlement_flag', 'emp_length_int', 'term_int',
       'mths_since_earliest_cr_line', 'months_since_last_credit_pull', 
       'delinq_record_risk_score', 'revol_bal_to_bc_limit', 'revol_bal_to_open_to_buy',
       'total_bal_ex_mort_to_inc', 'total_balance_to_credit_ratio', 'rev_to_il_limit_ratio', 
       'principal_paid_ratio', 'total_public_records', 'total_il_high_credit_limit']

len(features_to_drop)
Out[1480]:
69
In [1481]:
# Drop the original features.
df_inputs_prepr_copy = df_inputs_prepr.copy()
df_inputs_prepr = df_inputs_prepr.drop(columns = features_to_drop)
In [1482]:
df_inputs_prepr.shape[1]
Out[1482]:
344

E. Save the train and test datasets of the processed data in csv files.¶

In [1484]:
#####
# loan_data_inputs_train = df_inputs_prepr.copy()
#####
loan_data_inputs_test = df_inputs_prepr.copy()
In [1485]:
# save and export datasets in csv format.
loan_data_inputs_train.to_csv('C:/Disc D/365DataScience/Credit risk modeling/loan_data_inputs_train.csv')
loan_data_targets_train.to_csv('C:/Disc D/365DataScience/Credit risk modeling/loan_data_targets_train.csv')
loan_data_inputs_test.to_csv('C:/Disc D/365DataScience/Credit risk modeling/loan_data_inputs_test.csv')
loan_data_targets_test.to_csv('C:/Disc D/365DataScience/Credit risk modeling/loan_data_targets_test.csv')
In [1486]:
# shape of train dataset.
loan_data_inputs_train.shape
Out[1486]:
(1096932, 344)
In [1490]:
# shape of test dataset.
loan_data_inputs_test.shape
Out[1490]:
(274234, 344)

Conclusion:¶

  • After completing the data cleaning and preparation steps, the dataset was refined and structured for modeling. This included filling or removing missing values, eliminating features with excessively high proportions of missing data, and constructing a reliable target variable.

  • The dataset was then split into training and testing sets based on temporal criteria to simulate real-world prediction scenarios. Categorical and continuous variables were processed using Weight of Evidence (WoE) encoding and multicollinearity analysis (VIF), ensuring interpretability and statistical soundness.

  • Feature engineering techniques were applied to create new informative variables and enhance the predictive power of the dataset.

  • The resulting processed datasets were exported and saved in CSV format for use in the subsequent stages of credit risk modeling.

In [ ]: